What do organisations get wrong about AI observability?

Why This Matters for Security Teams

ai observability is often sold as if better dashboards automatically mean better control, but telemetry alone does not establish policy compliance, lineage, or accountability. Security teams need to distinguish operational visibility from governance evidence: a model can be fast, stable, and still process prohibited data, call the wrong tool, or expose secrets. That gap is why NIST’s NIST Cybersecurity Framework 2.0 matters here, because it frames visibility as part of a broader control system rather than an end state.

The practical risk is that teams over-trust monitoring outputs that only show health metrics, while attackers and misconfigurations operate in the control plane, prompt path, or tool chain. NHIMG research on the DeepSeek breach illustrates how exposed data and embedded secrets can sit outside the comfort zone of standard dashboards even when systems appear normal. The same pattern appears in secret sprawl and delayed remediation, as highlighted in The State of Secrets in AppSec. In practice, many security teams discover AI governance gaps only after a prompt, retrieval, or tool action has already crossed an approval boundary.

How It Works in Practice

Effective AI observability should capture what the system did, why it did it, and whether that action was allowed. That means logging more than latency and token counts. It means recording prompt inputs, retrieval sources, tool calls, model version, policy decisions, confidence thresholds, and the identity of the workload or agent that initiated the action. Current guidance suggests treating these records as evidence, not just debugging data.

A useful operating model is to separate three layers:

Infrastructure telemetry: uptime, errors, throughput, resource use.

Decision telemetry: prompts, retrieved context, tool invocation, policy evaluation, human approvals.

Governance evidence: immutable audit trails, retention controls, approval chains, and exception handling.

This is where frameworks like NIST Cybersecurity Framework 2.0 and the broader NIST AI governance approach become practical, because they push teams toward traceability and accountability instead of isolated monitoring. For NHI-led systems, the observability layer must also bind actions to the non-human identity that executed them. If a model or agent uses shared credentials, observability becomes ambiguous because the trail cannot prove which workload performed the action.

NHIMG’s The State of Secrets in AppSec also shows why secrets handling belongs inside observability design. If secret exposure is detected late, the logs must show when the secret was accessed, where it was used, and whether rotation or revocation followed. That is the difference between operational monitoring and defensible governance evidence. These controls tend to break down in environments with shared service accounts, weak log retention, or fragmented AI tooling because the action trail cannot be reconstructed end to end.

Common Variations and Edge Cases

Tighter observability often increases storage, privacy, and engineering overhead, requiring organisations to balance richer evidence against data minimisation and response speed. Not every environment can or should retain full prompts indefinitely, so current guidance suggests risk-based logging with explicit redaction and access controls.

The main edge case is regulated or high-volume AI where full capture is technically possible but operationally expensive. In those settings, teams often sample logs, hash sensitive inputs, or store only policy-relevant fields. That can work, but best practice is evolving and there is no universal standard for this yet. Another common failure mode is multi-agent orchestration: a single user request may trigger several downstream tools, making one dashboard line item useless unless it is correlated across the entire chain.

Observability also loses value when it is not tied to enforcement. If an alert says a model used disallowed data but nothing prevents recurrence, the system is only documenting failure. For that reason, security teams should align AI observability with incident response, access governance, and secret rotation, not treat it as a standalone analytics layer. Where model providers, vector stores, and tool gateways are operated by different teams, the evidence chain usually fragments and the governance story falls apart.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST CSF 2.0	DE.CM-1	Observability begins with continuous monitoring, but must extend beyond uptime metrics.
NIST AI RMF		AI RMF emphasizes traceability and accountability for AI outcomes and decisions.
OWASP Agentic AI Top 10		Agentic systems need auditability of prompts, tools, and autonomous actions.

Use AI RMF to define what evidence must be logged, retained, and reviewable for each AI decision.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

What do organisations get wrong about AI observability?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group