Subscribe to the Non-Human & AI Identity Journal

Notifications
Clear all

Canica and embedding neighbourhoods: what changes for data teams?


(@nhi-mgmt-group)
Member Moderator
Joined: 1 year ago
Posts: 9223
Topic starter  

TL;DR: Text embeddings can be mapped into an interactive 2D space with t-SNE and UMAP, helping teams inspect semantic neighbourhoods, spot cross-language overlap, and rerun plots on focused subsets, according to Lakera. The real value is governance visibility: model quality work starts with understanding what the training data is actually doing.

NHIMG editorial — based on content published by Lakera: Releasing Canica, a text dataset viewer

Questions worth separating out

Q: How should teams use embedding visualisations when reviewing text datasets?

A: Use them as a diagnostic layer, not as proof of quality.

Q: Why do dimensionality reduction plots sometimes mislead reviewers?

A: They compress high-dimensional relationships into a simplified view that preserves local similarity better than global structure.

Q: When should security or data teams rerun a plot on a subset of data?

A: Rerun it when the full dataset hides the question you need answered, such as language differences, rare classes, or a small but important segment.

Practitioner guidance

  • Validate suspicious clusters against source records When a 2D plot shows an unexpected boundary, inspect the nearest neighbours in the original embedding space and compare them to the raw text labels.
  • Review minority slices separately Rerun dimensionality reduction on narrow subsets such as one language, one product line, or one intent class.
  • Use embedding review as dataset quality evidence Record what the visual inspection showed, which outliers were accepted, and which anomalies were corrected before training or evaluation.

What's in the full article

Lakera's full article covers the operational detail this post intentionally leaves for the source: interactive dataset exploration mechanics, plotting workflow, and the practical user experience of canica.

  • Hands-on examples of using t-SNE and UMAP with text embeddings in a working notebook
  • The exact workflow for clicking points, inspecting neighbours, and replotting subsets
  • The rationale behind the canica interface choices and the internal feedback that shaped them
  • The GitHub and PyPI distribution details for teams that want to install and test the tool

👉 Read Lakera's article on canica and text dataset inspection →

Canica and embedding neighbourhoods: what changes for data teams?

Explore further

View Full Forum →  |  NHI Foundation Course →



   
Quote
(@mr-nhi)
Member Moderator
Joined: 2 months ago
Posts: 8662
 

Embedding inspection is becoming a governance control, not just a data science convenience. When teams train or evaluate models on text corpora, the question is no longer whether the embedding pipeline works in the abstract. The question is whether reviewers can see enough of the semantic structure to challenge bad labels, noisy clusters, and hidden dataset bias before those issues become model behaviour. That makes interactive inspection a governance surface for AI programmes, not a side utility for analysts.

A few things that frame the scale:

  • Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap, according to The State of Secrets in AppSec.
  • The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities.

A question worth separating out:

Q: What should teams document after reviewing text embeddings interactively?

A: Document the points inspected, the anomalies found, the subset used, and the decisions made about labels or exclusions. That creates a review trail that supports later audit, model debugging, and data governance discussions. Without that record, visual inspection stays informal and hard to defend.

👉 Read our full editorial: Canica shows why text dataset viewers need better embedding context



   
ReplyQuote
Share: