TL;DR: Building two eval systems for AI developer tools, one for a Claude Agent SDK CLI and one for context-loaded skills, showed why pass rates, transcript review, and domain-specific scoring mattered more than intuition or generic “helpfulness” metrics, according to WorkOS. The lesson is that non-deterministic AI outputs need outcome-based governance, not test-style expectations.
NHIMG editorial — based on content published by WorkOS: Writing my first evals, how I built a practical evaluation workflow to improve LLM reliability in real-world projects
Questions worth separating out
Q: How should security teams evaluate AI tools that behave differently on each run?
A: They should use evals, not deterministic tests.
Q: Why do extra prompts or context sometimes make LLM outputs worse?
A: Extra context can distract the model from the core task, especially when the added material is accurate but tangential.
Q: What is the best way to score AI agent workflows in production-like environments?
A: Score the end state, not the reasoning trace.
Practitioner guidance
- Define statistical success criteria before shipping AI tools Set first-attempt, correction, and retry thresholds for each scenario so the programme measures quality over many runs instead of chasing a perfect single-output test.
- Grade end-state artefacts, not just model chatter Use build output, file diffs, imported dependencies, and framework-specific checks as the primary evidence of whether the tool behaved acceptably in production-like conditions.
- Run A/B evals before adding more prompt context Compare with-context and without-context outputs on the same tasks to prove that the added material improves task performance instead of creating noise.
What's in the full article
WorkOS's full blog post covers the implementation detail this post intentionally leaves for the source:
- The full fixture and grading design for the WorkOS CLI evals, including how real project states are cloned and validated.
- The complete A/B scoring approach for WorkOS Skills, including the composite rubric and hallucination penalty logic.
- The practical transcript-diff workflow used to explain why a skill scored worse with context than without it.
- The human calibration loop used to tune the rubric when automated scoring disagrees with practitioner judgment.
👉 Read WorkOS's full post on building practical eval workflows for AI tools →
LLM evals for agent tools: what should teams measure first?
Explore further
Outcome-based evaluation is now part of identity governance for AI systems. When the output of a tool changes on every run, governance cannot rely on static approval or intuition. The article shows that the real control is not a test case but a measurement discipline that can separate passing behaviour from useful behaviour. That is directly relevant to autonomous workflows, where identity decisions and action quality need to be judged as outcomes, not intentions. Practitioners should treat eval design as a governance control, not an engineering afterthought.
A few things that frame the scale:
- The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
- Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap.
A question worth separating out:
Q: How do organisations know if their AI eval rubric is actually useful?
A: They should calibrate it against human review on a sample of cases and measure disagreement. If the automated scorer diverges from practitioner judgment too often, the rubric is optimizing the wrong thing. The best evals are domain-specific, explainable, and tied to the outcomes the team actually cares about.
👉 Read our full editorial: LLM evals for agent tools: why pass rates beat intuition