Subscribe to the Non-Human & AI Identity Journal
Home FAQ Agentic AI & Autonomous Identity How should security teams evaluate AI tools that…
Agentic AI & Autonomous Identity

How should security teams evaluate AI tools that behave differently on each run?

← Back to all FAQ
By NHI Mgmt Group Editorial Team Updated June 7, 2026 Domain: Agentic AI & Autonomous Identity

They should use evals, not deterministic tests. Define a realistic scenario set, score outcomes across many runs, and set pass-rate thresholds for first attempt, correction, and retry paths. The goal is to prove the system stays inside an acceptable boundary, not to force one exact output every time.

Why This Matters for Security Teams

Security teams should treat run-to-run variation as a governance problem, not a testing defect. An AI tool that answers differently on each run can still be safe if its outputs stay within a defined operational boundary, but only if the boundary is measured across realistic scenarios and failure modes. That is why deterministic “same prompt, same answer” checks miss the point for agentic systems and many LLM-driven workflows. The more relevant question is whether the system behaves predictably enough under change to remain safe, authorized, and auditable, which is closer to the intent of NIST Cybersecurity Framework 2.0 and the risk-based approach in NIST Cybersecurity Framework 2.0. For AI systems that rely on secrets, tool access, or external actions, the issue also overlaps with NHI exposure patterns seen in the DeepSeek breach, where exposed credentials and sensitive records amplified impact. In practice, many security teams discover that “randomness” only becomes a security incident after the system has already taken an unsafe action or leaked data, rather than during model selection.

How It Works in Practice

Effective evaluation starts with scenario design. Security teams should build a test set that reflects the actual job the tool is expected to do, then run each scenario many times to score outcome quality, policy adherence, and harmful side effects. The goal is not a single gold answer, but an acceptable pass-rate across first attempt, correction, and retry paths. For agentic systems, that means testing whether the model asks for permission at the right time, avoids out-of-scope tool calls, and respects access limits when it has to chain actions. This aligns with the risk-management approach in NIST Cybersecurity Framework 2.0 and the governance expectations in NIST Cybersecurity Framework 2.0, where controls are measured against business risk rather than exact output formatting. It also echoes the kind of exposure path highlighted in the DeepSeek breach, where leaked secrets and exposed systems turned a technical weakness into a broader security event. A practical evaluation loop usually includes:
  • Baseline prompts that represent normal user intent and expected tool use.
  • Edge-case prompts that try to induce hallucination, overreach, or policy bypass.
  • Retries and correction prompts to see whether the system recovers safely.
  • Scoring for both accuracy and safety, including disallowed actions, secret exposure, and unauthorized escalation.
  • Thresholds that define acceptable variance by task tier, not one universal pass mark.
That approach becomes stronger when paired with telemetry and review, because the same run-to-run variation that is harmless in a summarisation tool can be dangerous in an agent with write access or external side effects. These controls tend to break down when the model can call real systems with weak authorization boundaries because the evaluation proves behaviour, not containment.

Common Variations and Edge Cases

Tighter evaluation often increases cost and slows delivery, requiring organisations to balance confidence against test volume and operational overhead. Best practice is evolving, and there is no universal standard for what pass rates should be across every AI use case. A customer-support bot, a code assistant, and an autonomous agent with tool execution authority do not deserve the same tolerance levels. For low-risk use cases, a modest variance window may be acceptable if outputs stay inside approved language and no secrets or privileged actions are involved. For higher-risk systems, the threshold should be stricter, especially where the model can trigger workflows, modify records, or access NHI-backed services. The most common mistake is assuming statistical consistency alone equals safety. It does not. A system can be stable and still be wrong in a repeatable way, or be variable and still remain safe within a well-defined policy boundary. That is why current guidance suggests combining evals with human review, policy checks, and audit logging rather than treating benchmark scores as a release gate. Security teams should also distinguish between model variability and infrastructure weakness: if the tool is using long-lived secrets, weak RBAC, or no JIT boundary, a passing eval may still hide a serious exposure path. That distinction is especially important for autonomous agents, where the real failure often shows up in permissioning, not in the text output itself. For broader governance framing, map the programme to NIST Cybersecurity Framework 2.0 and review the incident patterns described in the DeepSeek breach.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
OWASP Agentic AI Top 10Agentic systems need evals that test unsafe tool use and policy bypass, not exact text matches.
CSA MAESTROMAESTRO covers governance for autonomous AI behaviour and runtime decision-making.
NIST AI RMFAI RMF frames risk-based evaluation of unpredictable model behaviour.

Define runtime guardrails and measure whether agent outcomes stay within approved operational bounds.

NHIMG Editorial Note
Reviewed and updated by the NHIMG editorial team on June 7, 2026.
NHI Mgmt Group — the #1 independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org