By NHI Mgmt Group Editorial TeamPublished 2026-02-19Domain: Governance & RiskSource: Collibra

TL;DR: AI governance breaks when organisations treat data selection as an afterthought, because low-quality, poorly contextualised, or noncompliant data drives confident but flawed model outputs according to Collibra. The real governance risk starts upstream, where data is chosen, classified, and understood before deployment hardens bad assumptions into automated decisions.


At a glance

What this is: This is Collibra's argument that AI governance starts with disciplined data understanding, not deployment controls, because data is the behaviour source for AI systems.

Why it matters: It matters to IAM and governance teams because the same control gap appears across AI, NHI, and human identity programmes when ownership, context, and policy do not travel with the asset.

By the numbers:

👉 Read Collibra's analysis of why data understanding comes first in AI governance


Context

AI governance often fails because teams optimise for model build speed before they decide whether the underlying data is suitable, lawful, and understood. In practice, that means the governance programme starts at deployment when the real risk has already been encoded in the inputs.

For IAM and security leaders, this is not just a data quality issue. The same pattern appears whenever an organisation cannot explain who owns an identity, what context surrounds its access, or whether policy follows the asset through its lifecycle.

Collibra frames this as a curation problem rather than an accumulation problem. That framing is useful because it aligns AI governance with the broader identity discipline of knowing what you have, why it exists, and who is accountable for its use.


Key questions

Q: How should security teams govern the data used for AI models?

A: Security teams should govern AI data the same way they govern high-risk identity assets: inventory it, assign ownership, classify sensitivity, and require approval before use. The key is to verify relevance, quality, context, and permission as separate checks. If any of those are missing, the model may produce credible but unsafe outcomes.

Q: Why does data context matter so much in AI governance?

A: Data context matters because AI systems learn patterns from the dataset, not just the field values. Without context, teams cannot tell whether a record is current, representative, permitted, or misleading. That creates a governance gap where the model appears accurate while actually embedding business, legal, or ethical errors.

Q: What do organisations get wrong about responsible AI governance?

A: A common mistake is assuming governance can begin after deployment. In reality, the model has already absorbed the shape of the data by then. Responsible AI requires upstream judgment about whether the dataset is suitable for the use case, and whether its use is permitted and defensible.

Q: How do teams know if their AI data governance is working?

A: It is working when teams can quickly answer who owns the data, why it is being used, whether it is suitable, and what policy restrictions apply. If those answers require manual reconstruction, the governance model is fragmented and the AI programme is operating on weak control foundations.


Technical breakdown

Why data becomes the behaviour source in AI systems

In traditional software, data is an input that can often be corrected after the fact. In AI systems, especially those used for classification or generation, data shapes the model's behaviour before an operator sees the result. That means provenance, labels, freshness, and context are not housekeeping details. They determine whether the system learns useful patterns or amplifies weak ones. If the source set is biased, stale, or incomplete, the model can produce outputs that sound credible while remaining wrong. This is why governance has to move upstream: once the model is trained or tuned, the bad assumptions are already embedded.

Practical implication: govern data selection before model training, not after deployment.

Data quality, context, and compliance are separate control problems

Teams often talk about data quality as if it were a single condition, but AI governance depends on at least four distinct checks. Relevance asks whether the dataset fits the use case. Quality asks whether it is accurate, complete, and current. Context asks where it came from, how it has been used, and what it means. Compliance and ethics ask whether the use is permitted at all. A dataset can pass one of these checks and still fail another. That is why treating data as simply 'available' is a governance error. Availability is not suitability, and suitability is not permission.

Practical implication: assess relevance, quality, context, and permission as separate approval gates.

Why unified governance matters for AI readiness

The article's core operational point is that fragmented ownership makes disciplined curation impossible. When data lives across disconnected systems, no one has a reliable view of what exists, who owns it, or whether policies have followed it. That problem is familiar to identity teams: access cannot be governed consistently when the inventory is incomplete and the context is scattered. Unified governance does not just centralise control. It makes stewardship repeatable by attaching business meaning to data assets and preserving that meaning as the data moves.

Practical implication: build a single governance view that links data inventory, ownership, and policy enforcement.


NHI Mgmt Group analysis

Data curation is becoming the identity governance problem that AI forces organisations to confront first. The article is right to frame step two as the point where responsible AI either takes root or collapses, because the same question appears in identity programmes: what exactly is being governed, and does the organisation understand it well enough to trust it? When data is the behaviour source, weak curation becomes a control failure, not a documentation issue. Practitioners should treat AI data selection as a governance boundary, not a procurement detail.

The assumption that data can be fixed later was built for conventional software, not AI. That assumption fails when models learn from the dataset itself and then operate at scale with high confidence. The implication is not just better cleansing, but a rethink of how AI governance sequences review, approval, and accountability before automation hardens the wrong pattern.

Curated data is the AI equivalent of lifecycle-managed identity inventory. If you do not know what data you have, who owns it, and where policy applies, you cannot claim control any more than you can claim identity governance without inventory and recertification. The practitioner conclusion is straightforward: governance maturity starts with visibility and context, then moves to enforcement.

Unified governance is the difference between isolated checks and defensible AI operations. The article's emphasis on shared visibility and consistent policy maps directly to how identity teams avoid fragmented entitlement decisions. In both domains, the control problem is not a lack of cleverness. It is a lack of a single, durable source of governance truth. Practitioners should align AI governance with identity governance operating models, not bolt it on separately.

Data confidence is a useful concept, but it only works when confidence is earned through evidence. The article's argument that people need to know which data they can use, why they can use it, and how it should be used is the right standard. That is the same threshold identity programmes should apply to entitlements, secrets, and machine identities. Practitioners should measure governance by explainability and ownership, not by volume of data under management.

From our research:

  • 70% of organisations grant AI systems more access than they would give a human employee performing the exact same job, according to The 2026 Infrastructure Identity Survey.
  • Only 44% of organisations have implemented any policies to manage their AI agents, despite 92% agreeing that governing AI agents is critical to enterprise security.
  • That gap matters because The State of Non-Human Identity Security shows only 1.5 out of 10 organisations are highly confident in securing NHIs, which is why governance has to move upstream.

What this signals

Data confidence is now a governance signal, not a data-team vanity metric. When organisations cannot explain why a dataset is suitable, they end up hard-coding uncertainty into AI behaviour. The practical shift for readers is to treat data provenance, ownership, and allowed use as first-class controls in AI and identity governance, not as metadata chores.

With 85% of organisations lacking full visibility into third-party vendors connected via OAuth apps, the State of Non-Human Identity Security shows how quickly governance breaks when context does not travel with the asset. The same lesson applies to AI data: if ownership and policy are fragmented, operational confidence is mostly theatre.

Curated inputs, not bigger data estates, are what will separate durable AI programmes from fragile ones. Readers should expect more pressure to prove lineage, permitted use, and stewardship across model pipelines. The governance model that wins is the one that can explain every critical input without reconstruction.


For practitioners

  • Separate data approval from model approval Require explicit review of relevance, quality, context, and permitted use before any dataset reaches training or tuning. Make the data decision auditable so teams cannot inherit unexamined assumptions from earlier project stages.
  • Create a governed data inventory with ownership attached Track the source, purpose, business owner, sensitivity, and downstream consumers for each dataset used in AI workflows. If ownership is unclear, the dataset should not be considered ready for production use.
  • Treat policy propagation as a lifecycle control Verify that classification, retention, and usage restrictions stay attached as data moves between platforms, teams, and model pipelines. This is the data equivalent of ensuring identity context follows access across environments.
  • Build cross-functional governance checkpoints Bring data owners, security, privacy, and AI builders into the same review path so no team can approve suitability in isolation. Shared approval is what turns curation from an aspiration into an operating model.
  • Measure governance by explainability and exception rate Track how often teams can explain data lineage, permitted use, and ownership without manual reconstruction. Rising exceptions are a sign that governance is fragmented and that AI decisions are being made on weak inputs.

Key takeaways

  • AI governance fails early when organisations treat data as a build input instead of the source of model behaviour.
  • Relevance, quality, context, and permission are separate control problems, and each one can fail independently.
  • Teams that want defensible AI need a governed data inventory with ownership, policy, and lineage attached.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

NIST AI RMF, NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
NIST AI RMFAI governance depends on upstream data suitability and accountability.
NIST CSF 2.0GV.OV-01Governance and oversight are needed for AI data selection and ownership.
NIST Zero Trust (SP 800-207)PR.AA-01Trustworthy access decisions depend on context that follows the asset.

Apply AI RMF GOVERN and MAP functions to validate data provenance, ownership, and permitted use before model work.


Key terms

  • Data Curation: The disciplined selection, review, and approval of data for a specific purpose. In AI governance, curation means more than cleaning records. It requires judging relevance, quality, context, and permitted use before data is allowed to shape model behaviour.
  • Data Provenance: The origin and history of a dataset, including where it came from, how it was transformed, and who owns it. Provenance matters because AI systems can amplify hidden defects when teams cannot explain the source or reliability of the input data.
  • Responsible AI: An operating model that ensures AI is used in a way that is explainable, lawful, and aligned to business intent. In practice, it depends on upstream data governance, documented model purpose, and ongoing verification rather than deployment-time controls alone.
  • Data Confidence: A governance outcome where teams know which data they can use, why they can use it, and how it should be used. It is not a feeling. It reflects traceable ownership, clear policy, and evidence that the data is suitable for the AI use case.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building or maturing an identity or security programme, it is worth exploring.

This post draws on content published by Collibra: The AI connoisseur. Curating high-quality data for responsible innovation. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-02-19.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org