Unstructured data governance is the bottleneck for GenAI and agents

By NHI Mgmt Group Editorial TeamPublished 2025-10-06Domain: Governance & RiskSource: Collibra

TL;DR: Most enterprise data remains unstructured, and without enrichment and governance it is hard for GenAI and agentic systems to retrieve, reason with, or trust the content, according to Collibra. The practical issue is not data volume alone but governed context: AI outputs become weaker when metadata, discoverability, and routing are missing.

At a glance

What this is: Collibra argues that most enterprise unstructured data is not yet AI-ready because it lacks the governance and metadata needed for reliable GenAI and agentic use.

Why it matters: For IAM and identity practitioners, the lesson is that AI governance depends on data governance because access, retrieval, and reasoning all break down when content is unstructured and poorly governed.

By the numbers:

IDC estimates that 80–90% of enterprise data is unstructured, yet much of it remains untapped.

👉 Read Collibra's analysis of unstructured data readiness for GenAI and agents

Context

GenAI and agentic workflows only become dependable when the systems behind them can find, interpret, and reuse the right content. In practice, most enterprises still rely on contracts, presentations, transcripts, and emails that sit outside clean taxonomy and metadata structures, which leaves AI working with fragmented context rather than governed knowledge.

That matters to identity and access programmes because AI output quality is increasingly tied to what the system is allowed to retrieve, how content is tagged, and whether the data path is governed end to end. The issue is not just search quality. It is whether enterprise AI can operate on controlled, explainable inputs rather than raw repositories full of duplicates and stale material.

Key questions

Q: How should security teams govern unstructured data used by AI systems?

A: Security teams should classify the source content, define metadata standards, and restrict AI pipelines to governed repositories with known ownership and review rules. The objective is not to make every file perfect. It is to ensure that models retrieve from trusted, current, and context-rich sources that can be audited later.

Q: Why do unstructured files create risk for GenAI and agentic workflows?

A: Unstructured files create risk because retrieval systems cannot reliably infer business meaning, ownership, or freshness from raw content alone. That increases the chance of stale inputs, irrelevant retrieval, and hallucinated answers. When agents depend on those outputs, the error is multiplied through automated decision-making.

Q: How can organisations tell whether AI input governance is actually working?

A: Look for lower duplication, better content discoverability, clearer source attribution, and fewer AI outputs that depend on manually corrected context. If teams still spend most of their time hunting for the right file or fixing weak answers, the governance layer is not yet effective.

Q: What should teams prioritise first for AI-ready content?

A: Start with the repositories that feed production AI use cases, then apply enrichment, ownership, and lifecycle controls to those sources before expanding outward. That sequence reduces risk faster than trying to govern every document store at once.

Technical breakdown

Why unstructured data breaks GenAI retrieval

Unstructured data is content without consistent machine-readable context. A document may contain the right facts, but if it lacks semantic tags, business metadata, or clean deduplication, retrieval systems struggle to identify what matters and why. In RAG and agentic workflows, that weakness matters because the model is only as good as the retrieved context it receives. Poor discoverability and outdated content produce brittle answers, lower confidence, and higher hallucination risk.

Practical implication: treat retrieval quality as a governance problem, not a prompt problem.

Semantic enrichment as an AI input control

Semantic enrichment adds structure to raw files by attaching business meaning, categories, and relationships that downstream systems can use. In this model, the content does not change, but the AI input layer becomes more usable because search, routing, and embeddings operate on governed metadata rather than ad hoc file text. That is why enrichment pipelines matter: they create a reusable control point for AI readiness across many use cases instead of one-off manual preparation.

Practical implication: standardise enrichment before AI consumption, not after bad outputs appear.

Enterprise AI search and agentic workflows

High-accuracy enterprise AI search depends on the same principles as good identity governance: consistent classification, controlled access, and predictable decision inputs. When applied to agentic workflows, search is not just a convenience layer. It becomes the mechanism that decides which knowledge an AI agent can operationalise. If the repository is noisy, duplicated, or semantically weak, the workflow can surface the wrong evidence at the wrong time and scale that error through automation.

Practical implication: validate the search substrate before allowing AI agents to rely on it for action.

NHI Mgmt Group analysis

AI readiness now depends on governed context, not just data volume. Enterprise AI projects fail less because they lack information and more because they cannot reliably select the right information at runtime. When unstructured content is duplicated, stale, or untagged, the model receives weak context and produces weak outcomes. The implication is that AI governance must start with the quality of inputs, not the sophistication of the model.

Metadata has become a control surface for AI behaviour. Semantic tagging, business-specific classification, and enrichment pipelines are no longer back-office cataloguing tasks. They shape what retrieval systems surface, what agents can use, and how confidently AI can act on enterprise knowledge. That makes metadata governance part of the identity and access conversation because the system is effectively deciding which content is in scope for machine use.

Unstructured AI exposes a familiar governance gap: hidden resources without lifecycle discipline. Contracts, emails, transcripts, and presentations often sit in repositories with no clear ownership, no expiry logic, and no consistent review path. That is the same structural problem identity teams see in unmanaged access artefacts. The named concept here is governed knowledge assets: content that is curated, classified, and operationally trustworthy enough for AI to consume. Practitioners should treat this as a prerequisite, not an optimisation.

Agentic workflows turn content governance into an execution risk. A human user can sometimes spot a stale document or a bad source, but an agent will faithfully operationalise whatever the retrieval layer presents. That means content quality failures can become action failures at machine speed. The practical conclusion is that enterprises need a governance model for what AI is permitted to retrieve, not just what it is permitted to see.

The market is moving toward AI governance that spans data, identity, and workflow control. Posts about unstructured data readiness are really about whether the enterprise can create dependable machine inputs at scale. That pushes security and IAM teams closer to data governance, because agentic systems do not stop at access control. They convert governed information into operational decisions, which raises the bar for traceability and accountability.

From our research:
98% of companies plan to deploy even more AI agents within the next 12 months, despite documented rogue behaviour in 80% of current deployments, according to AI Agents: The New Attack Surface report.
Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation.
For a forward view on how these governance gaps evolve, see Ultimate Guide to NHIs , 2025 Outlook and Predictions.

What this signals

Governed context is becoming the control plane for enterprise AI. As organisations move from experimentation to operational use, the question shifts from whether AI can consume content to whether it can consume the right content with enough traceability to support audit, incident review, and policy enforcement. The teams that treat enrichment as infrastructure will have a better starting point than those that treat it as a data-cleanup task.

AI content readiness will converge with identity governance practice. The same discipline used for entitlement ownership and review should be applied to knowledge sources that feed machine workflows. If a repository cannot be assigned an owner, a review cadence, and a clear scope of use, it should not be treated as a dependable AI input.

Metadata is the difference between retrieval and uncontrolled reuse. In agentic environments, search is not passive. It determines what information can influence action. That makes structured metadata and controlled retrieval pathways a prerequisite for scaling AI with confidence, especially where knowledge workers and automated agents share the same content estate.

For practitioners

Map AI use cases to content classes Identify which repositories feed GenAI, RAG, and agentic workflows, then separate high-value governed sources from noisy or duplicated content. Prioritise contracts, transcripts, emails, and presentations because those usually carry the most semantic ambiguity.
Define metadata standards for machine consumption Require business-specific tags, ownership, and lifecycle fields before content enters AI pipelines. Treat inconsistent metadata as a blocker for production use, not a cleanup task for later.
Introduce review logic for stale or duplicated sources Build periodic validation for content freshness, duplication, and source authority so AI systems do not keep retrieving outdated material. Tie this to knowledge ownership rather than ad hoc file management.
Separate search enablement from action enablement Allow AI systems to retrieve governed knowledge only after the search layer has been tested for precision, relevance, and scope. Do not permit downstream automation to depend on content that has not been validated for retrieval quality.

Key takeaways

Unstructured enterprise content becomes an AI liability when it lacks metadata, ownership, and review discipline.
The operational risk is not only hallucination, but also brittle retrieval that scales bad context into machine action.
Identity and data governance teams should align on governed knowledge assets before expanding GenAI and agentic use cases.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

NIST AI RMF, NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST AI RMF		AI input governance maps to managing trustworthy AI data and context.
NIST CSF 2.0	PR.DS-1	Data management and protection underpin governed AI-ready content.
NIST Zero Trust (SP 800-207)	PR.AC-4	Controlled content access is essential when AI systems retrieve from enterprise repositories.

Ensure AI source repositories have defined protection, integrity, and lifecycle controls before production use.

Key terms

Governed Knowledge Asset: A governed knowledge asset is unstructured content that has been classified, enriched, and assigned ownership so it can be used reliably by people or machines. In AI programmes, it is content with enough metadata, control, and accountability to support accurate retrieval and auditability.
Semantic Metadata: Semantic metadata is business meaning attached to content so systems can interpret what a file is, why it matters, and how it should be used. It goes beyond filename or folder structure by giving AI and search systems context for routing, ranking, and retrieval decisions.
AI Input Governance: AI input governance is the discipline of controlling which data sources an AI system may consume and under what conditions. It covers source quality, ownership, freshness, tagging, and lifecycle discipline so the model is trained or prompted from trusted and explainable content rather than raw repositories.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building or maturing an identity security programme, it is worth exploring.

This post draws on content published by Collibra: Making unstructured data AI ready: Unlocking value for GenAI and agents. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-10-06.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org