Canica shows why text dataset viewers need better embedding context

By NHI Mgmt Group Editorial TeamPublished 2026-04-24Domain: Best PracticesSource: Lakera

TL;DR: Text embeddings can be mapped into an interactive 2D space with t-SNE and UMAP, helping teams inspect semantic neighbourhoods, spot cross-language overlap, and rerun plots on focused subsets, according to Lakera. The real value is governance visibility: model quality work starts with understanding what the training data is actually doing.

At a glance

What this is: Canica is an interactive text dataset viewer that surfaces embedding neighbourhoods and subset behaviour in 2D plots.

Why it matters: It matters because data quality, model behaviour, and downstream security decisions all depend on whether teams can inspect what their embeddings are grouping together.

👉 Read Lakera's article on canica and text dataset inspection

Context

Text embeddings can hide quality problems when teams only inspect aggregate model outputs. A dataset viewer like canica helps practitioners see whether semantically related items cluster as expected, whether outliers are meaningful, and whether a reduced 2D view is obscuring important structure in the original embedding space.

For IAM, NHI, and AI governance teams, the practical issue is not visualisation alone. The governance question is whether teams can explain what data influenced a model, how subsets differ, and where hidden structure might create blind spots in training or evaluation workflows.

Key questions

Q: How should teams use embedding visualisations when reviewing text datasets?

A: Use them as a diagnostic layer, not as proof of quality. Review the map for clusters, outliers, and unexpected overlap, then confirm anything important in the underlying text and nearest neighbours. The goal is to catch semantic problems early, before a model learns from a misleading or uneven dataset.

Q: Why do dimensionality reduction plots sometimes mislead reviewers?

A: They compress high-dimensional relationships into a simplified view that preserves local similarity better than global structure. That means a plot can make points look more separated or more connected than they really are. Teams should treat the visual as a clue and validate the original embedding relationships before drawing conclusions.

Q: When should security or data teams rerun a plot on a subset of data?

A: Rerun it when the full dataset hides the question you need answered, such as language differences, rare classes, or a small but important segment. A focused subset can reveal whether the apparent cluster is real or only an artifact of scale, which improves both analysis and governance decisions.

Q: What should teams document after reviewing text embeddings interactively?

A: Document the points inspected, the anomalies found, the subset used, and the decisions made about labels or exclusions. That creates a review trail that supports later audit, model debugging, and data governance discussions. Without that record, visual inspection stays informal and hard to defend.

Technical breakdown

How t-SNE and UMAP turn embeddings into inspectable structure

t-SNE and UMAP are dimensionality-reduction methods that compress high-dimensional embeddings into a lower-dimensional visual map. They preserve local neighbourhood relationships better than global geometry, which means nearby points usually indicate semantic similarity rather than exact distance. That makes them useful for spotting clusters, overlaps, and anomalous items in large text datasets. The trade-off is that the 2D map is an approximation, so the plot should be read as an interpretive aid, not a ground truth representation of the embedding space.

Practical implication: treat the plot as a diagnostic surface and verify any surprising cluster with the underlying records.

Why local neighbourhood inspection changes dataset review

A 2D scatter plot alone can conceal why a point sits where it does. Neighbour inspection adds the missing context by showing the nearest items in the original embedding space when a point is selected. That lets reviewers test whether a cluster is coherent, whether a mislabeled item is actually semantically adjacent, and whether a model is learning a useful pattern or a shortcut. The value is operational, because it turns visualisation from presentation into investigation.

Practical implication: use neighbour inspection during dataset review, not after model failure, to catch semantic drift early.

How focused subset reruns expose hidden structure

Re-running dimensionality reduction on a selected subset can reveal structure that disappears in the full corpus. This matters because large datasets can compress away minority patterns, language differences, or narrow domains that are operationally important. By isolating a subset and recomputing the map, reviewers can see whether the selected items form a coherent subpopulation or whether the apparent cluster was an artifact of scale. That makes the tool especially useful for debugging training sets and evaluating representation quality.

Practical implication: rerun plots on targeted slices when you need to validate minority classes, language segments, or edge cases.

NHI Mgmt Group analysis

Embedding inspection is becoming a governance control, not just a data science convenience. When teams train or evaluate models on text corpora, the question is no longer whether the embedding pipeline works in the abstract. The question is whether reviewers can see enough of the semantic structure to challenge bad labels, noisy clusters, and hidden dataset bias before those issues become model behaviour. That makes interactive inspection a governance surface for AI programmes, not a side utility for analysts.

Local neighbourhoods matter more than global visuals when teams are trying to trust model inputs. A 2D plot can look tidy while still hiding semantically wrong neighbours in the original space. The practical lesson is that visual simplicity can create false confidence, especially when stakeholders treat a scatter plot as proof of quality. Practitioners should read dimensionality reduction as a hypothesis generator, not a validation artifact.

Subset replotting creates a sharper operational question: what patterns disappear at full scale? Large text corpora often flatten the very edge cases teams need to see, including language splits, niche intents, or anomalous records. Recomputing on a focused slice forces reviewers to ask whether the full dataset is smoothing over risk. That is a useful discipline for model builders, data owners, and AI governance leads.

Data quality tooling and AI governance now overlap at the point of review. The article sits in a broader shift where teams need more than training metrics to judge whether model inputs are safe and usable. Visual review, neighbour inspection, and slice-based exploration help translate abstract dataset quality into something practitioners can inspect and discuss. Teams that govern AI inputs should treat this kind of tooling as part of control evidence.

Canica highlights a named concept: semantic neighbourhood drift. When an embedding map shows points that look separated in 2D but remain close in the original space, reviewers can misread the dataset and miss important relationships. That gap between visual impression and embedding reality is exactly where review process failures start. Practitioners should use neighbour-based validation whenever a map suggests a boundary they care about.

From our research:
Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap, according to The State of Secrets in AppSec.
The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities.
That gap is why the NHI Lifecycle Management Guide matters when teams need to connect review, rotation, and offboarding into one operating model.

What this signals

Semantic neighbourhood drift: AI programmes need review tooling that shows where visual shortcuts diverge from embedding reality. As models and datasets grow, teams should expect more cases where a neat 2D map hides a risky relationship in the original space.

The broader signal is that dataset inspection is becoming a control point for AI governance, not a post-training curiosity. Teams that manage model inputs, secrets, or workload identity should build review workflows that combine visualisation, neighbour inspection, and traceable decision logging.

Only 44% of developers follow security best practices for secrets management, according to The State of Secrets in AppSec, which shows how often operational discipline lags behind confidence. That same pattern appears in AI data review when teams assume a view is enough without checking the underlying evidence.

For practitioners

Validate suspicious clusters against source records When a 2D plot shows an unexpected boundary, inspect the nearest neighbours in the original embedding space and compare them to the raw text labels. Do not accept the visual cluster as evidence until the underlying records support it.
Review minority slices separately Rerun dimensionality reduction on narrow subsets such as one language, one product line, or one intent class. This helps reveal structure that the full corpus can flatten and makes it easier to judge whether the subset is coherent.
Use embedding review as dataset quality evidence Record what the visual inspection showed, which outliers were accepted, and which anomalies were corrected before training or evaluation. That turns informal review into traceable quality evidence for model governance.

Key takeaways

Canica matters because interactive embedding review helps teams catch semantic errors that static plots can hide.
The main governance risk is false confidence, where a tidy 2D map obscures what the original embedding space is actually saying.
Practitioners should pair visualisation with neighbour inspection, subset reruns, and documented review decisions.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST AI RMF		Dataset review supports AI governance over input quality and model traceability.
NIST CSF 2.0	GV.OV-01	Governance oversight fits review workflows that need traceable quality decisions.
OWASP Agentic AI Top 10		Agentic and AI systems depend on trustworthy inputs and observable behaviour.

Use AI RMF governance practices to document dataset review and validate training inputs before model use.

Key terms

Dimensionality Reduction: A technique that compresses high-dimensional data into fewer dimensions so people can inspect patterns visually. In practice, it helps teams spot clusters and outliers, but it can also distort global structure, so it should be used as an aid to investigation rather than as proof of meaning.
Embedding Neighbourhood: The group of items closest to a point in embedding space, usually indicating semantic similarity. For practitioners, neighbourhoods are useful because they show whether a record belongs with its apparent peers, which helps validate labels, surface anomalies, and challenge false visual assumptions.
Subset Replotting: Recomputing a reduced-dimension view on a selected slice of the data instead of the full corpus. This is valuable when large datasets hide minority patterns or edge cases, because the focused view can reveal structure that otherwise disappears in aggregate.

What's in the full article

Lakera's full article covers the operational detail this post intentionally leaves for the source: interactive dataset exploration mechanics, plotting workflow, and the practical user experience of canica.

Hands-on examples of using t-SNE and UMAP with text embeddings in a working notebook
The exact workflow for clicking points, inspecting neighbours, and replotting subsets
The rationale behind the canica interface choices and the internal feedback that shaped them
The GitHub and PyPI distribution details for teams that want to install and test the tool

👉 Lakera's full post covers the embedding workflow, neighbour exploration, and subset replotting in more detail.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity lifecycle are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-04-24.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org