AI systems move sensitive data through training, retrieval, and prompt workflows that cross multiple identity boundaries. That means the risk is not only where the data is stored, but which identities can ingest, transform, and re-expose it. Data control therefore has to include identity governance, not just classification.
Why This Matters for Security Teams
AI systems make DSPM harder because the control plane no longer matches the data plane. Sensitive data can enter a model through training corpora, retrieval pipelines, logs, prompts, embeddings, and tool outputs, then reappear through a different identity than the one that ingested it. That breaks a storage-only view of data security and forces teams to govern who can move, transform, and disclose data at runtime.
This is why current guidance increasingly treats DSPM as part of identity and workload governance rather than a standalone classification exercise. NIST’s NIST Cybersecurity Framework 2.0 emphasizes risk ownership across processes, while NHIMG research on the State of Secrets in AppSec shows how quickly sensitive material becomes operationally hard to contain once it spreads across fragmented tooling and developer workflows. The challenge is amplified when AI systems can retain patterns, reproduce secrets, or expose them through downstream interactions.
In practice, many security teams discover DSPM gaps only after a model or agent has already traversed multiple data sources and identities, rather than through intentional design.
How It Works in Practice
Operationally, AI-aware DSPM needs to track data movement across every stage where an AI system can touch it. That includes ingestion for training, chunking and indexing for retrieval, prompt assembly, model inference, and post-processing in downstream applications. The relevant question is not just “where is the data stored?” but “which identity is allowed to access it, under what context, and for what purpose?”
Practitioners usually need to combine DSPM with workload identity, policy enforcement, and secrets governance. In practice, this means:
- Mapping sensitive datasets to the identities of models, agents, service accounts, and retrieval services that can access them.
- Using short-lived credentials and tightly scoped tokens so AI workloads do not inherit broad standing access.
- Evaluating policy at request time, because static allowlists do not reflect changing prompts, tools, or retrieval context.
- Logging prompt, retrieval, and tool activity with enough fidelity to reconstruct data exposure paths without storing more sensitive content than necessary.
That operational model aligns with the way NHIMG describes modern secret exposure and AI risk in LLMjacking: How Attackers Hijack AI Using Compromised NHIs, where credential misuse becomes the bridge between data access and system abuse. It also explains why NIST Cybersecurity Framework 2.0 is useful but incomplete on its own: DSPM for AI needs identity-aware controls, not just asset inventory and data labels.
These controls tend to break down in environments where retrieval is federated across many SaaS tools and model endpoints because provenance, permissions, and prompt context are fragmented.
Common Variations and Edge Cases
Tighter AI data controls often increase friction, requiring organisations to balance exposure reduction against model utility, latency, and developer velocity. That tradeoff is especially visible when teams try to apply the same DSPM rules to training data, production prompts, and agent tool calls, even though those flows carry different risk profiles.
Best practice is evolving, and there is no universal standard for this yet. Some teams treat embeddings as low-risk derived data, but that can be misleading if the embedding store becomes a searchable proxy for sensitive source material. Others focus only on prompt filtering, which misses exposure through retrieval augmentation, fine-tuning artifacts, cached completions, and third-party connectors. The current guidance suggests treating every AI data boundary as an identity boundary as well.
NHIMG’s DeepSeek breach coverage is a useful reminder that AI-specific data exposure is not theoretical. The operational lesson is straightforward: if an AI system can ingest data, it can also re-expose it through a different path, so DSPM must follow the identity that accessed the data, not just the place where the data was stored.
Standards & Framework Alignment
This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.
OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST AI RMF set the governance and control requirements practitioners need to meet.
| Framework | Control / Reference | Relevance |
|---|---|---|
| NIST CSF 2.0 | PR.AC-4 | AI data exposure depends on identity and access context, not storage alone. |
| NIST AI RMF | AI RMF addresses governance for data risk across model lifecycles and use contexts. | |
| OWASP Non-Human Identity Top 10 | NHI-03 | Secret exposure through AI workflows depends on weak credential lifecycle control. |
Map AI data flows to access controls and enforce least privilege at each retrieval and tool boundary.
Related resources from NHI Mgmt Group
Deepen Your Knowledge
Reviewed and updated by the NHIMG editorial team on June 7, 2026.
NHI Mgmt Group — the #1 independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org