AI security depends on data visibility in DSPM and governance

By NHI Mgmt Group Editorial TeamPublished 2025-06-19Domain: Best PracticesSource: Cyera

TL;DR: AI applications are expanding enterprise data exposure by pulling in emails, chat logs, legal documents, and cloud files, while legacy classification and access controls struggle to keep up, according to Cyera. The governance problem is now less about storing data than knowing what AI can reach, classify, and expose before that reach becomes a breach.

At a glance

What this is: This is Cyera’s analysis of how AI and machine learning are changing DSPM, with the core finding that data visibility and access governance now determine AI security outcomes.

Why it matters: It matters because IAM, NHI, and autonomous governance programmes all fail when teams cannot see what data an AI system can access, classify, or expose.

By the numbers:

The total amount of data worldwide is expected to reach 181 zettabytes in 2025.

👉 Read Cyera’s research on how AI and ML are changing DSPM

Context

AI security is becoming a data visibility problem before it becomes a model problem. As generative systems ingest more corporate content, security teams lose practical control over what information is discoverable, classifiable, and exposed across cloud, SaaS, and internal environments. That creates a governance gap for both AI-adjacent non-human identities and the human users who feed those systems.

DSPM exists to close that gap by discovering data, classifying it by sensitivity, and enforcing access controls against what the system actually sees rather than what the organisation assumes it has protected. In this article, the point is not that AI is inherently unsafe. The point is that AI expands the blast radius of weak data governance, especially when unstructured content and shadow AI live outside the visibility of legacy controls.

Key questions

Q: How should security teams govern AI systems that can access sensitive corporate data?

A: Security teams should govern AI systems as non-human identities with tightly scoped access, continuous discovery, and file-level classification of the data they can reach. The priority is to know what the system can see, then reduce that exposure to the minimum required for the use case. Without that context, AI security becomes guesswork rather than control.

Q: Why do generative AI tools increase data security risk?

A: Generative AI tools increase risk because they expand the number of places where sensitive content can be ingested, copied, surfaced, or misused. They also consume unstructured data that legacy classification tools often misread, which weakens policy enforcement. The result is a larger blast radius when access is over-permissioned or data visibility is incomplete.

Q: What breaks when organisations rely on manual data classification for AI security?

A: Manual classification breaks when the data set is too large, too diverse, or too unstructured for human review to stay accurate. Regex rules and hand-applied labels miss context, produce false positives, and leave sensitive material unclassified. In AI environments, that means access controls are built on an incomplete view of the information surface.

Q: How do you know if AI data governance is actually working?

A: AI data governance is working when discovery is continuous, classification confidence is high, and access to sensitive data is consistently limited to the smallest necessary set of identities and applications. Teams should also be able to show where regulated content resides, how it is protected, and which AI tools can reach it.

Technical breakdown

Why AI data discovery now has to span cloud, SaaS, and shadow AI

AI and machine learning tools are only as governable as the data surfaces they can reach. Discovery in a DSPM context means continuously inventorying data across on-premise systems, cloud stores, SaaS applications, and AI tools, then updating that inventory as content moves or changes. The hard part is not locating files once. It is maintaining an accurate view of where sensitive data lives when new AI applications, shared repositories, and copied datasets appear faster than manual review cycles can track.

Practical implication: build continuous discovery coverage across every environment where AI can read or copy sensitive content.

How unstructured data classification changes the AI security model

Generative AI has shifted the security burden toward unstructured data such as text, images, audio, and video. Traditional classification tools built on manual tagging or rigid regular expressions often miss context and generate false positives, which makes them unreliable at AI scale. DSPM tools use automated pattern recognition and contextual analysis to identify sensitive content at file level, then map that content back to risk and policy. That matters because AI does not need structured records to create exposure. It only needs accessible content with enough context to reproduce or disclose it.

Practical implication: replace manual or regex-only classification with automated file-level classification for AI-facing data stores.

What access governance means when AI tools are non-human identities

When AI tools can read corporate data, they function as non-human identities that need scoped access and visibility into entitlement boundaries. The security issue is not simply that the tool is powerful, but that over-privileged access lets it ingest or surface information beyond its intended use. DSPM adds context around what those identities can reach, which makes it possible to reduce privilege and constrain the blast radius if an AI workflow is misused or behaves unexpectedly.

Practical implication: treat AI tools as governed identities and verify their access against the sensitivity of the data they can reach.

Snowflake breach — Snowflake breach compromised Ticketmaster, Santander and others via cloud credential abuse.
Salesloft OAuth token breach — hackers stole OAuth tokens to access Salesforce data via Salesloft.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

AI security has become a data governance problem, not a model-only problem. The article is correct to frame AI risk through data visibility, because the first failure is often not prompt abuse but uncontrolled access to content that should never have been reachable by the system. That shifts the governance centre of gravity toward discovery, classification, and entitlement context. Practitioners should treat AI security as a control-plane issue for data access, not as a separate AI-only domain.

Unstructured data is the new weak point in enterprise classification. Manual labels and regex-driven classifiers were designed for bounded, structured data sets, not for the volume and ambiguity of modern AI inputs. Once email, chat, document stores, and media repositories become training or inference inputs, classification quality becomes the deciding factor in whether policy is enforceable at all. The implication is that classification fidelity now defines the practical limits of AI governance.

Blast radius, not just exposure count, is the decisive metric for AI-facing NHI governance. Cyera’s article points to non-human AI tools as identities that can be over-privileged, which means the core governance question is how much data a machine can reach if its scope is wrong. That is a shared problem across NHI and autonomous control programmes, because access scope determines whether one misconfiguration becomes a contained event or a broad disclosure. Practitioners should measure who can reach what before they measure how often the data is scanned.

Shadow AI creates governance debt because the organisation cannot govern what it cannot inventory. The article’s reference to hidden AI applications is a reminder that discovery is the prerequisite control for every downstream policy decision. If the team cannot see the tool, it cannot classify the data, constrain the access, or prove compliance. The practical conclusion is straightforward: inventory gaps are policy gaps in disguise.

AI governance and privacy compliance are converging around the same control set. Access controls, encryption, tokenization, logging, and residency decisions now operate as a single assurance surface when AI systems can ingest regulated information. This is where data governance, IAM, and compliance cease to be separate workstreams. Practitioners should align AI controls to the sensitivity of the data, not to the novelty of the workload.

From our research:
The average organisation believes more than 1 in 5 of their non-human identities are insufficiently secured, according to The 2024 ESG Report: Managing Non-Human Identities.
In the same report, 72% of organisations said they have experienced or suspect they have experienced a breach of non-human identities.
That is why the Ultimate Guide to NHIs , Key Challenges and Risks is the right next step for teams trying to reduce exposure across machine identities.

What this signals

AI governance will increasingly be judged by how fast teams can correlate discovery, classification, and access context. The practical standard is shifting from periodic review to continuous visibility, because AI systems consume data continuously and mutate the risk surface as they do so. In our view, teams that still separate data security, IAM, and AI oversight will miss the point of the control model.

The growth of shadow AI means inventory discipline now matters as much as policy design. If the organisation cannot see an AI tool, it cannot classify the data it touches or explain the entitlement path that made the exposure possible. That is a programme maturity issue, not just a tooling issue.

For practitioners

Inventory AI-readable data paths Map where generative AI tools, copilots, and machine learning pipelines can reach corporate content across cloud, SaaS, and on-premise stores. Include shadow AI and shared repositories so discovery reflects the real data surface, not just approved systems.
Replace manual classification for AI-facing datasets Use automated classification for unstructured data at file level, especially in email, chat, document, image, and media repositories that feed AI models. Manual tags and regex rules will not keep pace with the scale or ambiguity of AI consumption.
Scope non-human access by data sensitivity Review which AI tools and other non-human identities can access sensitive information, then reduce permissions to the minimum data set needed for the workflow. Tie access decisions to the confidentiality of the content rather than to the convenience of the application team.
Correlate access, classification, and compliance signals Bring data access, classification confidence, residency, retention, encryption, and logging into one operational view so teams can act on exposure before it becomes a reportable event. This is the control set that determines whether AI use is governable or merely tolerated.

Key takeaways

AI expands data risk by increasing what systems can see, not just what they can compute.
Legacy classification and access models are too brittle for unstructured data and shadow AI.
Practitioners should treat AI-facing tools as governed non-human identities and reduce their data reach first.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Non-Human Identity Top 10	NHI-03	AI tools with broad data reach create overexposed non-human identity risk.
NIST CSF 2.0	PR.DS	The article centers on protecting data across AI ingestion and access paths.
NIST Zero Trust (SP 800-207)	PR.AC-4	Zero trust access limits matter when AI tools can reach sensitive corporate content.

Apply data protection controls to AI-readable content and verify they work across all repositories.

Key terms

Data Security Posture Management: Data Security Posture Management is the practice of continuously discovering, classifying, and controlling sensitive data across environments. In AI-heavy environments, it becomes the control layer that shows what data exists, where it lives, and which systems or identities can reach it.
Unstructured Data: Unstructured data is information that does not fit neatly into fixed database fields, such as documents, chat logs, emails, images, audio, and video. It is harder to classify accurately, which makes it a common source of blind spots when AI systems ingest corporate content.
Shadow AI: Shadow AI is the use of AI applications or models that security teams have not formally discovered, approved, or governed. It creates an inventory gap that undermines classification, access control, and compliance because the organisation cannot protect what it cannot see.
Non-Human Identity: A non-human identity is any machine or software identity that can authenticate and access resources, including AI tools, service accounts, tokens, and APIs. In practice, these identities must be governed by the same discipline as human access, with tighter scope because machines can scale mistakes faster.

Deepen your knowledge

AI data governance and non-human access control are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building governance for AI-facing systems and machine identities, it is worth exploring.

This post draws on content published by Cyera: The Role of AI and ML in DSPM. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-06-19.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org