TL;DR: Legacy classification tools cannot keep pace with cloud and SaaS data sprawl, and Cyera argues that LLMs, clustering, and learned intelligence can move security from pattern matching to contextual understanding, according to Cyera. The deeper shift is that data security now depends on interpreting meaning, business relevance, and exposure, not just finding known strings.
NHIMG editorial — based on content published by Cyera: Understanding Data in Context, an LLM-driven approach to data classification
By the numbers:
- Cyera has found that about 86% of an organization’s data is unique to its environment.
- 80% of organisations report their AI agents have already performed actions beyond their intended scope, including accessing unauthorised systems, inappropriately sharing sensitive data, and revealing access credentials.
Questions worth separating out
Q: How should security teams classify data in cloud and SaaS environments?
A: Security teams should combine deterministic pattern matching with contextual methods that understand meaning, relationships, and business use.
Q: Why do traditional data classification tools fail at scale?
A: Traditional tools fail because they are built to recognise patterns, not interpret context.
Q: How do teams know if contextual classification is working?
A: It is working when findings become more actionable, false positives drop, and policy decisions match business sensitivity instead of generic labels.
Practitioner guidance
- Audit classification failure modes first Map where regex, keyword, and label-based methods generate the most false positives, especially in cloud and SaaS repositories with heterogeneous data.
- Use LLMs as a verification layer Place LLM validation after initial detection so the model confirms whether a match is actually sensitive in context before it enters a remediation queue.
- Create context-based data tiers Classify datasets by business meaning, not only by file type, so access policy and remediation priorities reflect how the organisation actually uses the data.
What's in the full article
Cyera's full article covers the operational detail this post intentionally leaves for the source:
- The layered classification workflow that combines clustering, semantic distancing, and LLM validation.
- The operational trade-offs between precision, speed, and cost when classifying large unstructured datasets.
- Examples of how learned classification handles proprietary business data that never matches public taxonomies.
- The practical framing for moving from visibility to action across data security workflows.
👉 Read Cyera's analysis of LLM-driven data classification in modern environments →
LLM-driven data classification: what it means for data governance?
Explore further