AI evaluation is becoming core infrastructure for reliable AI products

By NHI Mgmt Group Editorial TeamPublished 2026-04-15Domain: Best PracticesSource: WorkOS

TL;DR: Teams ship AI features quickly, but reliable validation still breaks down because probabilistic outputs do not fit deterministic software testing, according to WorkOS’ interview with Braintrust. Continuous evaluation, experiment tracking, and scored datasets are becoming the practical basis for knowing whether prompts, models, and retrieval pipelines are improving or silently regressing.

At a glance

What this is: This is an analysis of why AI evaluation and observability are becoming core infrastructure for production AI, with a focus on continuous testing rather than one-time validation.

Why it matters: It matters to IAM practitioners because AI systems are increasingly making runtime decisions that affect access, data handling, and governance, and those decisions need measurable controls across human, NHI, and autonomous programmes.

👉 Read WorkOS's interview on AI evaluation and reliable product development

Context

AI evaluation is the discipline of measuring whether model-driven features still behave as intended after prompts, models, or retrieval logic change. The article argues that AI products fail when teams rely on traditional software testing alone, because probabilistic outputs can be valid in more than one form and still drift over time.

For identity teams, that creates a governance problem as much as an engineering one. If an AI feature touches data access, workflow decisions, or user-facing approvals, then evaluation becomes part of control assurance, not just product quality.

The starting point described here is typical for teams moving from demos to production AI. They can build the feature, but they cannot yet prove the feature remains reliable under real-world variation.

Key questions

Q: How should security teams implement AI evaluation in production workflows?

A: Security teams should treat AI evaluation as a continuous control, not a pre-launch checklist. Build representative datasets, define scoring criteria for the outcomes that matter, and rerun tests whenever prompts, models, or retrieval logic change. That creates evidence for regression detection and release decisions instead of relying on intuition.

Q: Why do probabilistic AI outputs complicate traditional testing?

A: Probabilistic outputs can produce multiple valid answers for the same input, so exact-match tests miss acceptable variation and still fail to catch drift. Teams need scoring methods that measure quality, consistency, and boundary conditions over many runs, not just pass or fail on one expected string.

Q: What breaks when AI prompts are changed without evaluation?

A: The system may appear to work in a demo while silently degrading in production. Prompt changes can alter retrieval behaviour, output tone, or decision quality in ways that are hard to spot without comparison runs and scored datasets, which makes regressions harder to detect and explain.

Q: How do teams know if AI observability is actually working?

A: It is working when teams can show which change caused a quality shift, which dataset surfaced the issue, and whether the regression was contained before users were affected. If the team cannot trace behaviour across versions, observability is producing logs, not governance evidence.

Technical breakdown

Why deterministic testing fails for probabilistic AI outputs

Traditional unit tests assume a fixed input produces a fixed expected output. AI systems do not behave that way, because the same prompt can yield multiple acceptable responses depending on context, retrieval results, or model variance. That means the test target shifts from exact text matching to bounded quality measurement. Effective evaluation therefore combines datasets, scoring rules, and comparison across runs so teams can see whether a change improved the distribution of outcomes, not just one sample. In practice, this is closer to quality engineering than classical application testing.

Practical implication: define acceptance criteria that measure output quality, consistency, and regression risk instead of relying on exact-match tests.

Evaluation datasets and scoring functions as control surfaces

Evaluation becomes actionable when teams curate representative datasets and apply scoring functions that can be automated, reviewed by humans, or both. A dataset anchors the evaluation to real usage patterns, while a scorer turns subjective quality into something repeatable enough for release decisions. The architectural point is that the evaluation harness sits alongside development, not after it. That makes prompt changes, model swaps, and retrieval updates observable before they reach users, which is the difference between controlled iteration and blind shipping.

Practical implication: maintain representative eval datasets for the behaviours that matter most and score them before every meaningful change.

Experiment tracking for prompts, models, and retrieval pipelines

AI applications are rarely single-model systems. They usually combine prompts, retrievers, policies, and model variants, which means one change can alter several downstream behaviours at once. Experiment tracking lets teams compare versions side by side across the same benchmark suite, so they can attribute changes to the prompt, the model, or the retrieval layer. That matters because production AI failures often come from interaction effects rather than a single broken component. Without tracking, teams see only that quality changed, not why.

Practical implication: track prompt, model, and retrieval changes separately so regressions can be traced to the layer that introduced them.

Salesloft OAuth token breach — hackers stole OAuth tokens to access Salesforce data via Salesloft.
DeepSeek breach — DeepSeek breach exposed 1M+ log lines and sensitive secret keys.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

AI evaluation is becoming the missing assurance layer for runtime decision systems. The article is not really about tooling, it is about the fact that production AI behaves like a governed control surface, not a static feature. When outputs vary by context, the assurance model has to shift from pre-release correctness to continuous evidence that behaviour stays within bounds. For identity programmes, that means the confidence gap is no longer just in the model, but in the governance process around it.

Prompt engineering now sits in the same category as other production engineering disciplines. The strongest takeaway here is that prompts are not disposable text, they are change-managed logic that can alter downstream decisions. That means versioning, testing, and release discipline belong in the workflow wherever prompts influence access, routing, or data handling. Practitioners should treat prompt change control as part of broader operational governance.

AI observability and identity governance are converging on the same assurance question. If a model-driven workflow decides what a user sees, what a bot retrieves, or which path a request follows, then its behaviour has governance impact. The relevant discipline is not whether the model is impressive, but whether the organisation can explain and verify its decision patterns over time. Teams should expect audit and control demands to rise wherever AI sits inside access-adjacent workflows.

Evaluation infrastructure is now a prerequisite for safe AI scaling. The article makes the case that teams without repeatable evals are effectively flying blind once changes leave development. That is especially true where AI is embedded in workflows with security, privacy, or entitlement consequences. The practitioner conclusion is simple: no repeatable evaluation, no defensible production AI.

From our research:
The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap.
The evaluation gap mirrors the control gap in secrets handling, so readers should also review Ultimate Guide to NHIs , The NHI Market for broader identity context.

What this signals

Evaluation debt: teams that cannot measure AI quality changes before release will accumulate the same kind of hidden operational risk that appears when secret remediation is delayed. With a 27-day average time to remediate a leaked secret, according to The State of Secrets in AppSec, the lesson is that visible confidence is not the same as measurable control.

AI observability will increasingly be judged by whether it can produce defensible evidence for governance, not just debugging output. That matters wherever model behaviour intersects with data access, entitlement logic, or workflow approvals, because identity teams need auditable proof of decision quality over time.

As AI moves deeper into operational systems, evaluation should be treated as part of the control plane. Teams that can trace changes across prompts, models, and retrieval layers will be able to manage AI with the same discipline they apply to other governed identity workflows.

For practitioners

Build a reusable evaluation harness Create datasets that reflect the prompts, edge cases, and failure modes most likely to affect production outcomes. Re-run the harness whenever prompts, models, or retrieval pipelines change so regressions are visible before release.
Version prompt logic like code Store prompt text, scoring rules, and release notes in source control so changes can be reviewed, rolled back, and compared. Treat prompt edits as governed changes, not ad hoc tuning.
Separate model, prompt, and retrieval signals Track each layer independently during experiment runs so a quality drop can be traced to the right component. This reduces false confidence when one layer improves while another silently regresses.
Extend governance reviews to AI-driven workflows Map which AI features influence access, data exposure, approvals, or user routing, then include those paths in control testing and audit evidence. If the workflow affects decisions, it belongs in governance scope.

Key takeaways

AI evaluation has moved from a development convenience to a governance requirement for production AI.
Probabilistic model behaviour demands continuous scoring and experiment tracking because exact-match testing misses real regressions.
Teams that cannot measure AI changes before release will struggle to defend access-adjacent workflows once they reach production.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST CSF 2.0	GV.OC-01	AI evaluation supports measurable governance and operational confidence.
NIST AI RMF		Continuous evaluation aligns with monitoring and governance of AI risk.
OWASP Agentic AI Top 10		Runtime AI behaviour needs testing against unsafe or unexpected actions.

Evaluate AI outputs and tool use for drift, misuse, and unsafe behaviour before production.

Key terms

AI Evaluation: AI evaluation is the practice of measuring whether a model-driven system still behaves as intended across real inputs and changing conditions. In production, it combines datasets, scoring, and regression checks so teams can judge quality over time rather than trusting a one-time test pass.
Observability: Observability is the ability to explain what a system did and why, using logs, traces, metrics, and comparison data. For AI systems, it extends beyond uptime and errors to include behavioural changes, output quality, and evidence that decisions remain within expected bounds.
Prompt Engineering: Prompt engineering is the disciplined design and maintenance of the instructions that shape model behaviour. In production AI, it is treated as change-controlled engineering because small prompt edits can affect output quality, retrieval patterns, and downstream business or security decisions.

Deepen your knowledge

AI evaluation and observability are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If your organisation is connecting AI features to governed workflows, this is a relevant place to start.

This post draws on content published by WorkOS: Ameya Bhatawdekar on building AI evaluations at Braintrust. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-04-15.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org