LLM evals for agent tools: why pass rates beat intuition

By NHI Mgmt Group Editorial TeamPublished 2026-03-04Domain: Agentic AI & NHIsSource: WorkOS

TL;DR: Building two eval systems for AI developer tools, one for a Claude Agent SDK CLI and one for context-loaded skills, showed why pass rates, transcript review, and domain-specific scoring mattered more than intuition or generic “helpfulness” metrics, according to WorkOS. The lesson is that non-deterministic AI outputs need outcome-based governance, not test-style expectations.

At a glance

What this is: WorkOS explains how it built practical evaluation workflows for two AI developer tools and found that statistical pass rates, not intuition, were the only reliable way to judge whether they helped developers.

Why it matters: IAM and security teams should care because the same measurement problem appears in autonomous, NHI, and human workflows whenever AI output changes each run and governance depends on proof of value rather than assumptions.

👉 Read WorkOS's full post on building practical eval workflows for AI tools

Context

LLM evaluation is a governance problem before it is a tooling problem. When an AI system produces different outputs for the same input, the question is not whether it runs, but whether it consistently produces the right outcome under real operating conditions. That matters for agentic workflows, NHI-adjacent automation, and any identity programme that depends on repeatable access decisions or bounded execution.

The article shows two different evaluation patterns: fixture-based grading for an AI-enabled CLI and A/B comparisons for prompt-loaded skills. Both are attempts to replace gut feel with measurable quality signals, which is the same shift identity teams face when assessing whether a control actually reduces risk or only looks sound in theory. For practitioners, the lesson is that reliability has to be demonstrated, not inferred.

Key questions

Q: How should security teams evaluate AI tools that behave differently on each run?

A: They should use evals, not deterministic tests. Define a realistic scenario set, score outcomes across many runs, and set pass-rate thresholds for first attempt, correction, and retry paths. The goal is to prove the system stays inside an acceptable boundary, not to force one exact output every time.

Q: Why do extra prompts or context sometimes make LLM outputs worse?

A: Extra context can distract the model from the core task, especially when the added material is accurate but tangential. The right way to judge this is through A/B testing with and without the added context, then comparing composite scores and transcript behaviour. More context is only helpful if it improves the measured outcome.

Q: What is the best way to score AI agent workflows in production-like environments?

A: Score the end state, not the reasoning trace. Check whether the resulting files, dependencies, configuration, or build output satisfy the real operating constraints of the task. That approach is stronger than string matching because it measures whether the agent produced a usable result, not just a plausible one.

Q: How do organisations know if their AI eval rubric is actually useful?

A: They should calibrate it against human review on a sample of cases and measure disagreement. If the automated scorer diverges from practitioner judgment too often, the rubric is optimizing the wrong thing. The best evals are domain-specific, explainable, and tied to the outcomes the team actually cares about.

Technical breakdown

Why non-deterministic AI systems need evals, not tests

Traditional software tests assume a stable input-to-output mapping. LLM-powered tools do not behave that way because model output varies with context, prompt phrasing, and sampling, even when the underlying task is the same. An eval therefore measures a distribution of outcomes across scenarios, not a single expected response. That is why pass rates, weighted scoring, and repeated trials matter more than exact string matches. In practice, the evaluation question becomes whether the system’s behaviour stays inside an acceptable operational boundary across many runs.

Practical implication: define statistical success thresholds before deployment instead of writing tests that assume deterministic behaviour.

Fixture-based grading for AI agent workflows

A fixture-based eval creates controlled starting states, runs the real agent, and grades the resulting project state. That approach works because agent behaviour is often expressed through side effects, file changes, dependency updates, or code structure rather than a single text response. The article’s use of git diff as source of truth is important: it turns the end state into the artefact being judged. The stronger version of this method adds domain-aware checks, such as framework-specific constraints and build validation, so technically plausible output does not pass if it is operationally broken.

Practical implication: grade agent workflows on end-state artefacts, build results, and domain constraints rather than on conversational output alone.

A/B scoring for context-loaded skills

The second eval compares the same prompt with and without extra context in the system prompt. That isolates whether the added knowledge helps or distracts the model, which is essential when teams are tempted to assume that more context always improves performance. The article’s negative-scoring example shows the danger clearly: accurate information can still degrade results if it shifts the model away from the core task. A/B evals therefore answer a different question from functional tests. They measure net value, not correctness in isolation.

Practical implication: use controlled A/B evals to prove that added context improves outcomes before you ship it into production prompting.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Outcome-based evaluation is now part of identity governance for AI systems. When the output of a tool changes on every run, governance cannot rely on static approval or intuition. The article shows that the real control is not a test case but a measurement discipline that can separate passing behaviour from useful behaviour. That is directly relevant to autonomous workflows, where identity decisions and action quality need to be judged as outcomes, not intentions. Practitioners should treat eval design as a governance control, not an engineering afterthought.

Access decisions are only as defensible as the evidence used to score them. The article’s use of end-state grading mirrors a broader identity lesson: what matters is whether the actor left the environment in an acceptable state, not which internal steps it took. That aligns with NHI governance thinking, where secret handling, build integrity, and execution boundaries matter more than narratives about process. If the scoring model cannot explain a pass or fail with domain-specific evidence, it will eventually certify the wrong thing.

Negative feedback loops are a real failure mode in AI enablement. The article’s most important finding is that added context can make an AI system worse, not better. That is a named concept worth carrying forward: context toxicity, where accurate supporting material degrades the primary task by pulling the model toward tangential detail. For identity teams, the warning is obvious. More context, more permissions, or more orchestration does not equal more control. Practitioners should assume that additive inputs can become operational noise.

Human review still matters, but only when it is calibrated to the right signal. The article’s human-in-the-loop calibration step shows that automated scoring without human validation can produce false confidence. That matters across IAM, NHI, and agentic AI because human oversight is only useful when it checks the same thing the programme actually cares about. If reviewers are calibrating style while the system fails on safety or scope, governance is drifting. Practitioners should align review criteria to the risk outcome, not the easiest thing to inspect.

Identity teams should expect evaluation patterns to converge across humans, NHIs, and agents. The same core question appears in all three domains: does the actor behave within an acceptable operational boundary often enough to trust it? That is why NHI controls, autonomous-agent evals, and human access governance are increasingly linked. The discipline is shifting from proving that a mechanism exists to proving that it behaves consistently. Practitioners should build measurement into the control plane itself, because assurance without evidence does not scale.

From our research:
The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap.
That gap makes a strong case for Analysis of Claude Code Security when you need to understand how AI-assisted development changes control expectations.

What this signals

Context toxicity: adding accurate background material can still reduce output quality when the model starts optimizing for the wrong signal. That is a direct warning for AI governance teams, because more permissions, more prompts, and more orchestration do not automatically improve reliability. If the programme cannot measure net lift, it cannot tell whether a control is helping or merely looking sophisticated.

The evaluation pattern described here aligns closely with the discipline behind OWASP Non-Human Identity Top 10 and NIST AI 600-1 Generative AI Profile: define the boundary, score the outcome, and validate that the added control does not create a new failure mode. For practitioners, the practical shift is toward evidence-backed change management instead of model enthusiasm.

The strongest programme signal is not whether an AI workflow can be made to work once, but whether it remains useful when compared against a baseline without the extra control. That is the same logic identity teams need when they decide whether a new governance step actually reduces risk or just adds review burden.

For practitioners

Define statistical success criteria before shipping AI tools Set first-attempt, correction, and retry thresholds for each scenario so the programme measures quality over many runs instead of chasing a perfect single-output test.
Grade end-state artefacts, not just model chatter Use build output, file diffs, imported dependencies, and framework-specific checks as the primary evidence of whether the tool behaved acceptably in production-like conditions.
Run A/B evals before adding more prompt context Compare with-context and without-context outputs on the same tasks to prove that the added material improves task performance instead of creating noise.
Calibrate automated scoring against human judgment Review a sample of scored cases against practitioner judgment and tune the rubric when disagreement shows the scorer is optimizing the wrong outcome.
Save full transcripts for every run Keep the complete interaction log so you can explain why a score moved, isolate regressions, and distinguish a real defect from a misleading metric.

Key takeaways

AI evals are about outcome quality across many runs, not exact-output correctness in one run.
Adding context to a model can reduce reliability when the context is accurate but distracts from the task.
The practical control is measurement, because trust in AI behaviour is earned through repeatable evidence.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 and OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Non-Human Identity Top 10	NHI-01	The post centers on measurable failure modes in AI-adjacent identity workflows.
OWASP Agentic AI Top 10		The article evaluates tool-using AI systems and context effects on output quality.
NIST AI RMF		AI governance needs measurable evaluation and human calibration.

Establish governance, measurement, and oversight processes that tie AI quality to documented risk outcomes.

Key terms

Evals: Evals are structured measurement systems for judging whether an AI tool performs well across many runs. They do not try to prove one exact output is correct. Instead, they use scoring, thresholds, and repeated scenarios to show whether the tool is reliably useful in practice.
A/B Evaluation: A/B evaluation compares the same prompt or workflow with and without a specific control, such as added context or a skill. It isolates the net effect of that control on output quality, which is especially useful when a model is non-deterministic and simple pass/fail tests are not enough.
Context Toxicity: Context toxicity is the failure mode where accurate supporting information makes an AI system perform worse because the extra material distracts it from the main task. In practice, the issue is not false information but misweighted information, which can lower reliability even when the added context is technically correct.
Outcome-Based Grading: Outcome-based grading judges the final state of a task rather than the steps used to reach it. For AI agents, that means assessing the resulting code, files, or configuration against task-specific criteria, because the internal reasoning path is often less useful than the observable result.

Deepen your knowledge

AI eval design and outcome-based scoring are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building governance for AI tools that behave non-deterministically, it is worth exploring.

This post draws on content published by WorkOS: Writing my first evals, how I built a practical evaluation workflow to improve LLM reliability in real-world projects. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-03-04.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org