TL;DR: As AI-generated code and agent-assisted development speed up change, the codebase is becoming less reliable as the source of truth and observability has to close the gap, according to WorkOS's interview with Honeycomb CEO Christine Yen. The governance challenge is no longer just debugging faster software, but proving what software actually did when humans and agents both shape production behaviour.
At a glance
What this is: This is an interview-driven analysis of why observability becomes more important as AI agents and AI-generated code make software behaviour harder to predict.
Why it matters: It matters because identity, access, and audit teams increasingly need evidence from production behaviour, not just code review, to govern human, NHI, and autonomous systems.
👉 Read WorkOS's interview with Honeycomb CEO Christine Yen on AI-driven observability
Context
AI-generated code and agent-assisted development change the assumption that engineers can understand a system by reading the code alone. When software is created and modified faster than people can review it, observability becomes the mechanism for proving what actually happened in production, including when humans and agents are both involved in the delivery chain.
For IAM and security teams, the governance question is not only whether the software works, but whether the operating model still provides enough evidence to assign accountability, detect drift, and validate access-dependent behaviour. That becomes more important as engineering workflows blend human decision-making with machine-generated changes.
Key questions
Q: How should teams govern AI-generated code when they cannot review every change?
A: Teams should shift from source-only assurance to runtime assurance. That means correlating deployments, traces, logs, and outcome metrics so behaviour can be validated after code is generated and released. The practical goal is not perfect review coverage, but a dependable record of what the system actually did in production.
Q: Why does observability matter more when humans and agents both change software?
A: Because the operating model becomes harder to infer from code alone. When humans and agents both shape the release path, observability provides the evidence needed to answer who changed what, what executed, and whether the outcome stayed inside policy. Without that evidence, accountability becomes ambiguous.
Q: How do teams know whether observability is working for AI-heavy systems?
A: They should look for fast conversion of production surprises into new tests, clear links between runtime events and business impact, and reliable explanation of unexpected outcomes. If anomalies can be traced to intent, execution, and effect, observability is doing its job. If not, it is only producing noise.
Q: What is the difference between evals and observability in AI operations?
A: Evals test anticipated behaviour before release. Observability shows what the system actually did under real conditions after release. Teams need both because evals are bounded by what they expected, while observability reveals the failures, edge cases, and unintended effects that only appear in production.
Technical breakdown
Why observability outruns code review in AI-built systems
Code review assumes a stable relationship between source code and runtime behaviour. AI-generated code breaks that assumption because the system can change faster than engineers can inspect it, and the result may be logically correct in review but operationally surprising in production. Observability captures runtime events, traces, metrics, and logs, which are the only reliable reference when the codebase is no longer the full explanation of behaviour. In this model, the truth lives in execution, not in the repository.
Practical implication: treat production telemetry as a control plane for understanding AI-accelerated software change.
Evals and observability are different control layers
Evals are pre-production tests for anticipated behaviour. Observability covers the unknown unknowns, where real users trigger conditions the test set did not model. The two are complementary because evals define expected outcomes, while observability exposes deviation in live systems. For AI systems, that feedback loop is essential: a surprising production event should become a new eval case, otherwise the organisation keeps relearning the same failure in different places.
Practical implication: wire production anomalies back into test design so governance improves with each unexpected runtime event.
Business-defined quality signals matter more in non-deterministic systems
Non-deterministic software cannot be judged only by infrastructure health. A service can return successfully and still create business harm if the output is wrong, unsafe, or off-policy. That is why subjective quality measures, such as whether a user streak is at risk, become operationally important. For AI-heavy systems, observability has to track whether the outcome is acceptable to the business, not merely whether the request completed.
Practical implication: define observable business-impact signals before non-deterministic behaviour becomes too complex to interpret.
NHI Mgmt Group analysis
AI-generated software creates an observability gap before it creates a debugging problem. The core failure is not that teams lack more dashboards. The failure is that code review no longer fully describes runtime behaviour when machines can generate and modify logic faster than humans can validate it. Practitioners should treat this as a governance shift, where execution evidence becomes more authoritative than source artefacts.
The old assumption that the codebase is the source of truth is already collapsing. That assumption was designed for systems where humans could reasonably inspect change before deployment. It fails when software is generated and iterated at machine speed because no team can read enough of the code to explain all operational outcomes. The implication is that accountability has to move toward runtime evidence and production observability.
Runtime truth over source truth: this is the right named concept for AI-accelerated software operations. The system's real behaviour increasingly lives in traces, logs, metrics, and outcome signals, not in the repository alone. That changes how security, engineering, and identity teams validate control effectiveness across human and machine actors. Practitioners should formalise runtime evidence as the basis for trust decisions.
Observability is becoming a cross-functional identity control, not just an engineering tool. Once humans and agents both contribute to software change, the questions become who changed what, what actually executed, and whether the resulting behaviour stayed within policy. That makes observability relevant to access governance, incident triage, and release accountability. Security leaders should align operational telemetry with identity and approval records.
AI development pipelines expose a new governance boundary between intent and execution. Evals capture intended behaviour, while observability captures what the system did under real conditions. The organisations that close that loop will be able to govern non-deterministic software more credibly than those that rely on static review alone. Practitioners should design processes around that boundary.
From our research:
- 33% of organisations report their AI agents have accessed inappropriate or sensitive data beyond their intended scope, according to AI Agents: The New Attack Surface report.
- 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation.
- For a broader control model, compare that with OWASP Agentic AI Top 10 and treat runtime evidence as part of governance, not just troubleshooting.
What this signals
Runtime proof will become part of the control stack: as AI-generated code expands, teams will need stronger links between execution evidence and governance records. That means observability data starts to matter to security architects, not just SREs, because it becomes the only reliable way to explain what software did when code review could not keep up.
The next programme risk is not simply more complexity, but more unreviewed variation. Honeycomb's framing points to a world where AI-assisted development makes behaviour drift faster than governance cycles, so teams should expect more demand for evidence-based controls, business-impact telemetry, and clearer accountability across human and machine contributors.
For practitioners
- Re-anchor assurance in runtime evidence Map critical production decisions to traces, logs, and outcome metrics so reviewers can explain what the system actually did, not only what was committed.
- Turn production surprises into new eval cases Create a formal loop that converts unexpected live behaviour into repeatable tests, so each anomaly improves the next review cycle.
- Define business-impact signals for non-deterministic services Identify the user or revenue outcomes that matter most, then instrument them so a successful API call that causes harm is still visible.
- Tie change records to execution records Correlate deployment, access, and automation activity with runtime events so human and machine contributions can be assigned cleanly.
Key takeaways
- AI-generated code weakens the assumption that source code alone explains runtime behaviour, so observability becomes a governance control as much as an engineering one.
- Production telemetry matters because non-deterministic systems can appear healthy while still producing outcomes that violate intent or policy.
- Teams that connect runtime evidence to evals, identity records, and business impact will have a much stronger basis for accountability in AI-heavy environments.
Standards & Framework Alignment
This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.
OWASP Agentic AI Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST AI RMF set the governance and control requirements practitioners need to meet.
| Framework | Control / Reference | Relevance |
|---|---|---|
| NIST CSF 2.0 | DE.CM-1 | Observability maps directly to continuous monitoring of system behaviour. |
| OWASP Agentic AI Top 10 | AI-generated code and agentic workflows create runtime behaviour and oversight risks. | |
| NIST AI RMF | GV.1 | AI systems need governance that links intent, monitoring, and accountability. |
Assign explicit governance ownership for AI-assisted development and runtime assurance.
Key terms
- Observability: Observability is the ability to infer what a system is doing from its external outputs, such as logs, traces, and metrics. For AI-accelerated environments, it is the evidence layer that helps teams explain runtime behaviour when code review no longer captures the full story.
- Evals: Evals are structured tests used to measure whether an AI system behaves as intended under anticipated conditions. They are useful for pre-release validation, but they do not replace production monitoring because they cannot cover every real-world edge case or emerging failure mode.
- Runtime Evidence: Runtime evidence is the record of what software actually did in production, including execution traces, decisions, and outcomes. It matters when generated code, automation, or agents create behaviour that cannot be fully understood from source code alone.
- Non-Deterministic Behaviour: Non-deterministic behaviour is software behaviour that does not produce the same outcome every time under similar inputs. In AI systems, this makes traditional testing and code review incomplete unless teams also capture and analyse production outcomes.
Deepen your knowledge
NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.
This post draws on content published by WorkOS: Honeycomb CEO Christine Yen on why observability matters more than ever as AI agents reshape software. Read the original.
Published by the NHIMG editorial team on 2026-04-15.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org