Notifications

Clear all

Canica and embedding neighbourhoods: what changes for data teams?

Last Post

RSS

NHI Mgmt Group

(@nhi-mgmt-group)

Member Moderator

Joined: 1 year ago

Posts: 12387

Topic starter 05/07/2026 6:44 pm

TL;DR: Text embeddings can be mapped into an interactive 2D space with t-SNE and UMAP, helping teams inspect semantic neighbourhoods, spot cross-language overlap, and rerun plots on focused subsets, according to Lakera. The real value is governance visibility: model quality work starts with understanding what the training data is actually doing.

NHIMG editorial — based on content published by Lakera: Releasing Canica, a text dataset viewer

Questions worth separating out

Q: How should teams use embedding visualisations when reviewing text datasets?

A: Use them as a diagnostic layer, not as proof of quality.

Q: Why do dimensionality reduction plots sometimes mislead reviewers?

A: They compress high-dimensional relationships into a simplified view that preserves local similarity better than global structure.

Q: When should security or data teams rerun a plot on a subset of data?

A: Rerun it when the full dataset hides the question you need answered, such as language differences, rare classes, or a small but important segment.

Practitioner guidance

Validate suspicious clusters against source records When a 2D plot shows an unexpected boundary, inspect the nearest neighbours in the original embedding space and compare them to the raw text labels.
Review minority slices separately Rerun dimensionality reduction on narrow subsets such as one language, one product line, or one intent class.
Use embedding review as dataset quality evidence Record what the visual inspection showed, which outliers were accepted, and which anomalies were corrected before training or evaluation.

What's in the full article

Lakera's full article covers the operational detail this post intentionally leaves for the source: interactive dataset exploration mechanics, plotting workflow, and the practical user experience of canica.

Hands-on examples of using t-SNE and UMAP with text embeddings in a working notebook
The exact workflow for clicking points, inspecting neighbours, and replotting subsets
The rationale behind the canica interface choices and the internal feedback that shaped them
The GitHub and PyPI distribution details for teams that want to install and test the tool

👉 Read Lakera's article on canica and text dataset inspection →

Canica and embedding neighbourhoods: what changes for data teams?

Explore further

View Full Forum → | NHI Foundation Course →

Quote

Topic Tags

Mr NHI

(@mr-nhi)

Member Moderator

Joined: 3 months ago

Posts: 11961

05/07/2026 7:00 pm

Embedding inspection is becoming a governance control, not just a data science convenience. When teams train or evaluate models on text corpora, the question is no longer whether the embedding pipeline works in the abstract. The question is whether reviewers can see enough of the semantic structure to challenge bad labels, noisy clusters, and hidden dataset bias before those issues become model behaviour. That makes interactive inspection a governance surface for AI programmes, not a side utility for analysts.

A few things that frame the scale:

Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap, according to The State of Secrets in AppSec.
The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities.

A question worth separating out:

Q: What should teams document after reviewing text embeddings interactively?

A: Document the points inspected, the anomalies found, the subset used, and the decisions made about labels or exclusions. That creates a review trail that supports later audit, model debugging, and data governance discussions. Without that record, visual inspection stays informal and hard to defend.

👉 Read our full editorial: Canica shows why text dataset viewers need better embedding context

ReplyQuote

Forum Statistics

11 Forums

13.6 K Topics

26.1 K Posts

39 Online

135 Members

Latest Post: LLM security and AI-driven crime: what security teams must change Our newest member: Alex Recent Posts Unread Posts Tags

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Get in Touch

Quick Links

FAQ

NHI 101 Articles

Legal & Policies