Subscribe to the Non-Human & AI Identity Journal

How do organisations decide whether agentic red teaming is actually working?

Organisations should judge agentic red teaming by coverage of runtime paths, not by the number of prompts tested. If the programme can map verified tools, permissions, data flows, and downstream actions into reproducible exploit chains, it is working. If it only produces response-level failures, it is still testing the wrong surface.

Why This Matters for Security Teams

Agentic red teaming is only useful when it measures whether autonomous systems can be pushed into unsafe actions, not whether they can be tricked into saying the wrong thing. That distinction matters because agents chain tools, reuse context, and operate with delegated authority. A prompt-only test can miss the real failure mode entirely: an innocuous instruction that leads to data access, workflow abuse, or privilege escalation. Current guidance from OWASP Top 10 for Agentic Applications 2026 and the NIST AI Risk Management Framework both point toward outcome-based evaluation, but there is no universal standard for scoring agentic red teaming yet.

NHI Management Group’s research on the OWASP NHI Top 10 shows why this is difficult in practice: once an agent can reach tools and secrets, the test surface becomes runtime behaviour, not model output. Security teams need evidence that a red team can reproduce exploit chains across identity, permissions, and downstream actions, then show that controls block, contain, or detect them. In practice, many security teams discover the weakness only after an agent has already touched production data or executed an unauthorised action, rather than through intentional validation.

How It Works in Practice

Effective measurement starts by defining the agent’s runtime attack surface: tools, credentials, data sources, tool-call sequence, approval gates, and the actions that can follow each step. Red teamers should then build scenarios that attempt to move from harmless input to harmful outcome, such as exfiltrating a secret, modifying a ticket, invoking a payment flow, or chaining multiple tools to reach a restricted dataset. The best programmes score whether the exploit chain was reproducible, whether the control failed open or closed, and whether the failure was observed, blocked, or remediated.

That approach aligns with the practical direction in the CSA MAESTRO agentic AI threat modeling framework and the MITRE ATLAS adversarial AI threat matrix, both of which emphasise adversarial paths, not isolated model failures. A practical test harness usually includes:

  • Verified tool inventory and permission scope for each agent
  • Annotated data-flow mapping from input to downstream action
  • Repeatable exploit chains with clear success criteria
  • Telemetry for tool calls, policy decisions, and revocation events
  • Recovery checks that confirm access is removed or constrained after detection

For organisations using the research patterns in NHIMG’s AI LLM hijack breach coverage, a meaningful result is not that the agent produced unsafe text, but that the red team could demonstrate how a prompt became a concrete operational breach. The evaluation should also verify whether policies are enforced at request time, not just documented on paper. These controls tend to break down in environments where agents have broad tool access, weak approval boundaries, and multiple hidden connectors because the exploit path becomes faster than human review.

Common Variations and Edge Cases

Tighter red-team scoring often increases operational overhead, requiring organisations to balance test depth against engineering time, model churn, and production risk. That tradeoff matters because a fully realistic agentic test can disrupt workflows if it exercises live connectors or high-value credentials.

Best practice is evolving for multi-agent systems, delegated browsers, and long-running copilots. In these environments, a single prompt may not matter unless it changes agent state across several steps, so teams should evaluate persistence, memory poisoning, tool chaining, and escalation paths separately. Some organisations use sandboxed replicas; others prefer staged production controls with hardened guardrails. There is no universal standard for this yet, but the consensus is moving toward runtime evidence, not benchmark theatre.

One useful indicator is whether the red team can force the agent to violate intent without obvious model refusal. Another is whether defenders can prove that least privilege, short-lived credentials, and policy enforcement actually limited blast radius. The LLMjacking: How Attackers Hijack AI Using Compromised NHIs research is a reminder that identity abuse often appears faster than teams expect, which is why static test cases are rarely enough. In edge cases such as offline agents, shared service identities, or indirect prompt injection through third-party content, organisations should treat red teaming as continuous assurance rather than a one-time pass or fail event.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Agentic AI Top 10 A2 Assesses agent abuse paths and unsafe tool-driven behaviour.
CSA MAESTRO T1 Focuses on threat modeling of agent actions and attack paths.
NIST AI RMF Supports evaluation of AI risks through measurable, documented testing.

Map tests to agent abuse paths and prove controls stop harmful tool use, not just bad outputs.