How should security teams govern unstructured data for GenAI use cases?

Why This Matters for Security Teams

Unstructured data is where GenAI risk becomes practical, not theoretical. Documents, chats, ticket exports, code snippets, presentations, and support transcripts often contain sensitive context that is invisible to simple labels but highly valuable to an LLM. The governance problem is therefore not just “can the model read it,” but “should this content influence outputs, memory, retrieval, or downstream actions.” That is why security teams need business context, ownership, and use-path control, not a classification program that stops at file names.

Current guidance from the NIST Cybersecurity Framework 2.0 and the NIST AI 600-1 GenAI Profile points toward risk-based governance, but unstructured data still breaks many organisations because the context lives outside the document itself. NHIMG research on the State of Secrets in AppSec shows why this matters: 43% of security professionals are concerned that AI systems will learn and reproduce sensitive patterns from codebases. In practice, many security teams discover unsafe content pathways only after an assistant has already indexed, summarized, or exposed data that business owners never expected to be reusable.

How It Works in Practice

Effective governance starts by mapping unstructured content to business meaning, not just sensitivity labels. A file may be “internal,” but that says nothing about whether it contains customer data, source code, pricing strategy, or incident notes that should be excluded from GenAI retrieval. Security teams should combine DSPM with business ownership so each data domain has an accountable owner who can decide whether the content is permissible for training, retrieval-augmented generation, summarization, or agentic tool use.

A practical operating model usually includes three controls:

Discovery and enrichment: classify repositories, chats, and document stores, then enrich them with owner, system, and business function metadata.

Access and use policy: define which content can be read, chunked, embedded, cached, or used as model memory, and tie those decisions to human review where the risk is high.

Data path enforcement: apply policy at ingestion and at query time so an LLM cannot bypass controls simply because a user had indirect access to the source.

This is especially important for secrets, credentials, and operational runbooks. NHIMG’s Top 10 NHI Issues and the Ultimate Guide to NHIs — Lifecycle Processes for Managing NHIs both reinforce that identity and lifecycle controls matter when data becomes machine-consumable. If a knowledge base includes API keys, deployment instructions, or recovery steps, that content should be treated as a high-risk input channel, not a normal search index. When implemented well, the goal is to make policy decisions before the model sees the content, then re-check them whenever the content is retrieved, summarized, or attached to an agent action.

These controls tend to break down in sprawling collaboration environments where content is copied across SaaS tools faster than ownership and policy metadata can be maintained.

Common Variations and Edge Cases

Tighter governance often increases operational friction, requiring organisations to balance model usefulness against review overhead and slower content onboarding. That tradeoff is real, especially when teams want broad enterprise search or copilots across hundreds of repositories.

Best practice is evolving for mixed-trust environments. Some organisations allow low-risk internal content into GenAI systems by default and gate only sensitive domains; others require explicit approval for every source system. There is no universal standard for this yet. The right answer depends on the data class, the intended use, and whether the model is allowed to persist information beyond the current session.

Edge cases usually involve content that looks harmless in isolation but becomes sensitive when combined. Meeting notes may not contain regulated data, but they can reveal deal strategy, incident details, or hidden credentials. Similarly, source code may not be “secrets,” yet a model can infer operational patterns that should not be surfaced broadly. NHIMG’s Ultimate Guide to NHIs — Regulatory and Audit Perspectives is useful here because auditors will ask not only what was indexed, but who approved the use path and how exclusions were enforced. A mature programme also tracks exceptions, because unmanaged exceptions quickly become the real policy. In mature environments, the hardest failures appear when a “safe” document corpus is repurposed for agents with tool access, because retrieval risk turns into action risk almost immediately.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST AI RMF		Governance of GenAI data use maps to AI risk management and oversight.
NIST CSF 2.0	PR.DS	Unstructured data governance is a data-security and protection problem.
OWASP Non-Human Identity Top 10	NHI-01	GenAI systems often expose secrets and sensitive content through unmanaged non-human access.

Treat model-facing repositories as NHI attack surfaces and remove secrets before ingestion.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

How should security teams govern unstructured data for GenAI use cases?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group