AI data security needs DSPM built for unstructured model flows

By NHI Mgmt Group Editorial TeamPublished 2026-05-29Domain: Governance & RiskSource: Orca Security

TL;DR: Legacy data security tools cannot track how sensitive data moves through vector embeddings, RAG corpora, prompt logs, and model weights, leaving AI pipelines exposed to irreversible leakage, according to Orca Security. The governance shift is from post-storage discovery to pre-training control, because once data is embedded, conventional remediation no longer works.

At a glance

What this is: Orca Security argues that traditional DSPM fails for AI because unstructured data flows, shadow AI, and model-weight exposure require purpose-built discovery, lineage, access control, and remediation.

Why it matters: IAM, NHI, and AI governance teams need the same visibility and control model across training data, inference endpoints, and machine identities if they want to prevent irreversible exposure and prove compliance.

By the numbers:

According to Gartner, more than 55% of organizations have deployed or are piloting generative AI tools.

👉 Read Orca Security's analysis of AI data security posture management for AI models

Context

AI data security is no longer just a data classification problem. Once sensitive content enters embeddings, RAG corpora, prompt logs, or model weights, the governance problem changes from finding data to controlling how that data can be transformed, recombined, and exposed across the AI lifecycle.

Orca Security’s framing is that legacy DSPM was built for structured stores and static repositories, while AI needs controls that see shadow AI, multi-cloud pipelines, and inference-time exposure. For identity teams, that means the boundary between data governance and access governance is now much thinner than most programmes assume.

Key questions

Q: How should security teams govern sensitive data in AI training and inference pipelines?

A: Security teams should treat AI data governance as a lifecycle control, not a point-in-time scan. Classify data before it enters training, track lineage through embeddings and inference, and enforce least privilege on the identities that move data through those stages. The goal is to prevent irreversible exposure before model weights make remediation expensive or impossible.

Q: Why do traditional DSPM tools fall short for AI workloads?

A: Traditional DSPM tools were built around structured databases and file stores, so they do not fully account for embeddings, prompt logs, RAG corpora, or model weights. AI data becomes risky when it is transformed, combined, or memorised, which means visibility alone is not enough. Teams need controls that understand the AI lifecycle and intervene earlier.

Q: What do security teams get wrong about shadow AI?

A: They often treat shadow AI as a usage-policy issue when it is really an unmanaged data access problem. Employees and developers can move sensitive content into copilots, external LLMs, and fine-tuning pipelines without any corresponding identity review. If access is not governed at the data layer, policy exceptions become data exposure.

Q: How can organisations prove AI data governance for auditors and regulators?

A: Use continuous evidence rather than periodic documentation. Maintain classification reports, lineage maps, access logs, and remediation records that show what data entered the AI system, who could access it, and what was blocked or removed. That evidence supports EU AI Act and NIST AI RMF expectations without relying on manual reconstruction.

Technical breakdown

Why legacy DSPM breaks on vector embeddings and model weights

Traditional DSPM tools are designed for known databases, file stores, and perimeter-oriented discovery. AI systems move data through embeddings, tokenisation, fine-tuning datasets, and prompt-response logs, which means the sensitive object is often transformed before it is ever stored in a conventional repository. Once data is embedded in model weights, selective removal is not realistic in the way it is for a file or table row. That makes intervention timing the real problem, not just visibility. AI-aware DSPM has to understand where the data is, how it changes form, and when the window for control closes.

Practical implication: Security teams need controls that act before training or fine-tuning consumes sensitive data, not after model exposure has become persistent.

Data lineage and toxic risk combinations in AI workflows

Data lineage in AI is the end-to-end record of how information moves from ingestion through preprocessing, embedding, training, and inference. The key risk is that individually acceptable datasets can become unsafe when combined, especially in RAG retrieval and cross-dataset enrichment. A dataset that is anonymised in one context can become re-identifiable when joined with a second source. Lineage also supports incident response because teams can identify which models, training runs, and outputs are affected without defaulting to broad retraining. In AI environments, lineage is both a governance artifact and an operational containment tool.

Practical implication: Teams should map lineage deeply enough to answer which models consumed which data, and which outputs inherit the resulting exposure.

Shadow AI and access controls across AI workflows

Shadow AI is the unmanaged use of LLMs, copilots, and fine-tuning pipelines outside security review. The access problem is not just who can log in, but which identities can feed data into training systems or query inference endpoints with sensitive content. Because developers often use multiple AI tools across clouds and SaaS services, access control has to be granular and continuous rather than attached to a single platform. This is where AI data governance intersects with IAM and NHI governance: the same identity may be allowed to use a tool but not to move regulated data through it. Without that distinction, policy enforcement becomes inconsistent.

Practical implication: Govern access at the data layer, and review identity entitlements for AI tools as part of the same control plane as NHI access.

DeepSeek breach — DeepSeek breach exposed 1M+ log lines and sensitive secret keys.
New York Times breach — New York Times source code and credentials exposed via GitHub.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

AI data governance is now an identity problem as much as a storage problem. Once sensitive content enters AI workflows, the question is no longer only where the data resides but which identities can move it, transform it, and expose it. That spans human users, service accounts, copilots, and pipeline identities, which means conventional perimeter-oriented data controls miss the governance surface. Practitioners need a control model that treats AI data movement as an access decision, not just a classification event.

Unstructured AI data creates a governance gap that traditional DSPM was not built to close. Embeddings, prompt logs, RAG corpora, and model weights do not behave like rows and files, so the old assumption that sensitive data can be neatly catalogued before exposure no longer holds. The implication is that security programmes must reframe AI data risk around transformation boundaries and persistence, not just discovery coverage. That is a structural change in how data security and identity teams coordinate.

Toxic risk combinations are the right named concept for AI data exposure. The article’s core warning is that safe-looking datasets can become unsafe when they are combined in retrieval or training pipelines, especially where re-identification or memorisation becomes possible. This is not a simple DLP problem. Practitioners should treat dataset composition, not just dataset content, as the unit of governance.

Continuous compliance evidence matters because AI changes faster than manual assurance cycles. The governance problem is not only whether sensitive data was approved, but whether teams can prove what entered training, what was blocked, and what was remediated as models changed. That shifts compliance from periodic documentation to runtime evidence generation. Practitioners need evidence that follows the AI lifecycle, not a retrospective file assembled after the fact.

From our research:
The average organisation believes more than 1 in 5 of their non-human identities are insufficiently secured, according to 2024 ESG Report: Managing Non-Human Identities.
Enterprises that have experienced a compromised NHI averaged 2.7 separate incidents in the past 12 months, according to the same research.
That pattern reinforces the need to read AI data governance through the lens of identity control, not just storage control, as the State of Secrets in AppSec shows with leaked secret remediation taking 27 days on average.

What this signals

Toxic risk combinations: AI programmes now fail most often at the point where individually acceptable data sources become unsafe when combined inside retrieval or training flows. That means security teams should watch for composition risk, not just presence of sensitive data, and align their controls to the AI lifecycle rather than the storage layer alone.

With more than 55% of organisations already deploying or piloting generative AI tools, the governance problem is no longer experimental. The programme signal is clear: if AI access reviews, data lineage, and remediation evidence are still separate workstreams, the control environment will lag the workload.

Identity teams should expect AI data governance to converge with NHI governance because the same unmanaged access paths move data, prompts, and outputs. The practical test is whether a team can show who can feed regulated content into AI systems, where it goes, and what happens when it should not have been there.

For practitioners

Map AI data flows end to end Inventory every place sensitive data can enter AI systems, including RAG corpora, model registries, prompt logs, fine-tuning datasets, and third-party copilots. Keep the map current as new tools and pipelines appear, because AI workloads change too quickly for periodic discovery alone.
Classify sensitive data before model consumption Block PII, PHI, and proprietary code from entering training or fine-tuning until classification and policy checks complete. The useful control point is before embeddings or model weights make the exposure durable.
Tie lineage to response scope Build lineage records that show which datasets, training runs, and inference paths consumed regulated content. Use those records to scope containment precisely when an AI data issue appears, instead of defaulting to broad retraining.
Review AI access with identity governance teams Treat access to AI pipelines and inference endpoints as an identity governance problem, not just a data owner decision. Review which human, NHI, and service identities can move regulated content into or out of AI workflows.
Automate evidence for AI compliance controls Generate continuous audit trails for classification, access decisions, remediation, and blocked prompt-response pairs. That evidence should be usable for EU AI Act and NIST AI RMF reporting without a separate manual evidence chase.

Key takeaways

Traditional DSPM is not enough for AI because embeddings, prompt logs, RAG corpora, and model weights change the exposure model.
AI data risk is amplified by lineage, composition, and shadow AI, which means governance has to start before training and continue through inference.
Practitioners need continuous evidence, granular access control, and lifecycle-based identity governance to manage AI data safely at scale.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Non-Human Identity Top 10	NHI-01	Shadow AI and AI pipeline identities create unmanaged non-human access paths.
NIST AI RMF		AI RMF addresses governance, mapping, and measurement across AI lifecycle risks.
NIST CSF 2.0	PR.AC-4	Least privilege is central to controlling access to AI data and pipelines.

Inventory AI-connected NHIs and remove unmanaged access paths before data reaches training or inference.

Key terms

Data Security Posture Management For AI: Data security posture management for AI is the discipline of discovering, classifying, and governing sensitive data as it moves through training, fine-tuning, and inference. It extends traditional DSPM into unstructured data, model-adjacent stores, and runtime exposure paths where data can become embedded and difficult to remove.
Shadow AI: Shadow AI is the use of AI tools, copilots, or model workflows outside formal security oversight. It matters because unmanaged AI access often creates unreviewed data movement, making sensitive information easier to expose through prompts, retrieval systems, or fine-tuning pipelines.
Data Lineage: Data lineage is the record of how information moves and changes across systems, from ingestion through transformation to output. In AI environments, lineage helps teams identify which models consumed sensitive data, where risky combinations were created, and what needs to be contained if a problem appears.
Toxic Risk Combination: A toxic risk combination is a set of data sources that seem acceptable individually but become unsafe when combined in an AI workflow. The risk emerges through re-identification, memorisation, or unintended disclosure, which means governance must consider composition as well as content.

Deepen your knowledge

AI data security posture management is covered in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building controls for AI pipelines, copilots, and service identities at the same time, it is a relevant starting point.

This post draws on content published by Orca Security: The AI Data Security Crisis and DSPM for AI. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-05-29.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org