Break-your-own AI agent testing needs a red-team framework

By NHI Mgmt Group Editorial TeamPublished 2026-04-23Domain: Agentic AI & NHIsSource: ZioSec

TL;DR: AI agent red-teaming becomes evidence production, not prompt chaos, when a six-phase framework defines scope, threat modelling, attack chaining, evidence, remediation, and retesting across Claude Code, OpenClaw, and custom harnesses, according to ZioSec. The core lesson is that AI agent security must be tested as a runtime attack surface mapped to auditor-ready controls, not as a collection of isolated jailbreak prompts.

At a glance

What this is: This is a practical six-phase framework for red-teaming AI agents that maps attack chains to evidence, remediation, and re-testing.

Why it matters: It matters because IAM, PAM, and AI governance teams need repeatable ways to prove which agent actions, tool calls, and data paths are in scope before an attacker does.

👉 Read ZioSec's framework for red-teaming AI agents and producing evidence

Context

AI agent red-teaming is the discipline of deliberately trying to make an agent misuse tools, expose data, or execute unsafe actions so defenders can see what really breaks. The governance gap is that many teams still treat agents as if prompt safety and endpoint security are enough, even though the real risk sits in tool access, memory, and delegated runtime behaviour.

For IAM and NHI programmes, the problem is not just whether the agent can be blocked. It is whether scope, accountability, and evidence exist before the agent reaches a destructive or exfiltrating state. The article frames that gap as a repeatable test process, which is the right starting point for builders who need more than a one-off pentest.

The approach is typical for mature security teams because it translates abstract AI risk into artefacts auditors can review. That matters across NHI, autonomous, and human identity programmes because the same question keeps returning: who can do what, with which tools, under what evidence trail, and how do you prove it held under attack?

Key questions

Q: How should security teams run red-team testing for AI agents?

A: Start with a scoped inventory of the agent’s harness, tools, data sources, and memory, then build goal-based attack chains that cross prompts and tool calls. A useful red-team run produces evidence, severity, and a repeatable remediation path, not just a list of failed jailbreaks. Re-test the original chain after fixes to prove the control now holds.

Q: Why do AI agents need a different testing approach from web applications?

A: AI agents can act through delegated tools, memory, and chained prompts, so the risky behaviour often appears after a normal login or approval step. Web-app pentests can miss tool misuse, indirect injection, and cross-harness drift. The difference is runtime decision flow, not just interface exposure, which means testing must follow the agent’s execution path.

Q: How do you know if AI agent remediation is actually working?

A: The original attack chain must fail after the fix, and close variants should fail too. If the same goal can still be reached with different wording or a different tool sequence, remediation is partial. The strongest signal is a repeatable post-fix verification log that shows the harmful outcome no longer occurs.

Q: What should auditors expect in an AI agent evidence package?

A: Auditors need the achieved goal, the exact reproduction chain, the framework mapping, a severity rating tied to blast radius, and the remediation timeline. Without those elements, the finding is hard to govern because it cannot be routed to the right control owner or verified after the fix.

Technical breakdown

Scope definition for AI agent red-teaming

Scope is the boundary that turns AI agent testing from guesswork into evidence. In this framework, scope means enumerating the harness, model, tools, data sources, memory, and state so testers know what the agent can actually touch. That matters because many failures are not model failures at all. They are access failures, where an agent reaches a system or secret that nobody wrote down in the first place. A usable scope document also fixes the compliance target, because OWASP ASI, MITRE ATLAS, NIST AI RMF, ISO 42001, and AIUC-1 each ask slightly different questions about control and accountability. Practical implication: treat scope as a control artefact, not a planning note.

Practical implication: force every agent into an inventory-backed scope statement before any red-team run begins.

Attack chains across model, protocol, and tool layers

A useful AI agent attack is rarely a single prompt. It is a chain that moves from intent to tool call to target effect, often across multiple turns and sometimes across indirect prompt injection or delegated tool use. That is why the article rejects prompt lists as a testing method. A prompt list can show susceptibility; a chain can show impact. The technical distinction matters because many agent harnesses look safe in isolation but fail once a second tool, a memory lookup, or a follow-on message changes the execution path. Practical implication: test chained abuse scenarios, not isolated jailbreaks.

Practical implication: simulate multi-turn, multi-tool chains that end in a concrete harmful outcome, not just a model refusal.

Evidence packages and remediation verification

Evidence is what makes AI agent red-teaming useful outside the security team. The article correctly pushes findings into a package that includes the achieved goal, the exact chain, framework mapping, severity, and remediation timeline. That structure matters because auditors and CISOs need reproducibility, not anecdote. The final re-test step closes the loop by rerunning the same chain after remediation and proving the control now fails the attack. Without that verification, teams only have a promise that the issue was fixed. Practical implication: require pre- and post-remediation chain results before closing any finding.

Practical implication: do not close findings until the original chain and its close variants fail after remediation.

Threat narrative

Attacker objective: The attacker aims to make the AI agent perform a harmful action on their behalf while preserving the appearance of legitimate agent behaviour.

Entry begins when an attacker or tester supplies a goal-based prompt chain to the AI agent through its normal conversational or workflow interface.
Credential or capability abuse occurs when the agent is induced to call tools, access data sources, or execute commands beyond the intended task boundary.
Escalation happens through chained prompts, multi-tool interactions, or indirect injection that widens the agent’s effective scope and reaches the target system or data.
Impact is achieved when the agent exfiltrates sensitive content, sends restricted data externally, or executes an unsafe command on a connected system.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

AI agent red-teaming only becomes governance when it produces evidence. A Friday afternoon prompt sweep is noise, not assurance, because it does not bind attack paths to scope, severity, or remediation. The article is right to move builders toward reproducible chains and evidence packages, because that is the point where AI security becomes auditable and operational. Practitioners should treat testing outputs as governance artefacts, not research notes.

Identity controls for agents fail when the test stops at authentication. The article’s focus on tools, data sources, and memory shows that the real security surface is delegated runtime capability, not login state. That is where OWASP NHI and OWASP agentic guidance intersect: the agent may authenticate cleanly and still be unsafe because its effective authority spans tool calls the access review never modelled. Practitioners need to evaluate the whole execution path, not just the front door.

Runtime attack chaining is the named concept this framework sharpens. The critical failure mode is not a single malicious prompt but the accumulation of benign-looking steps into one harmful outcome. That is why single-turn jailbreak testing misses the field’s real exposure, while chained attacks expose tool misuse, indirect injection, and cross-harness drift. The implication is simple: if your control cannot explain multi-step abuse, it cannot explain agent risk.

AI agent governance now sits between NHI discipline and autonomous behaviour analysis. The article’s framework is harness-agnostic because the attack surface persists even when the wrapper changes, which is a familiar NHI lesson. At the same time, the use of goal-based chained actions pulls the problem toward autonomous-style runtime behaviour, where outcomes are produced by decision paths rather than static scripts. Practitioners should align red-team findings to both NHI control failures and AI governance obligations.

Framework mapping is not a reporting detail, it is how findings become decision-ready. If a chain cannot be mapped to OWASP ASI, MITRE ATLAS, NIST AI RMF, ISO 42001, or AIUC-1, most organisations will fail to route it to the right owner. The article correctly treats mapping as part of the evidence package, not a postscript. Practitioners should make framework alignment mandatory for every agent finding.

From our research:
98% of companies plan to deploy even more AI agents within the next 12 months, despite documented rogue behaviour in 80% of current deployments, according to AI Agents: The New Attack Surface report.
Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation.
Ultimate Guide to NHIs , Regulatory and Audit Perspectives is the next step if you need to turn agent findings into audit-ready control evidence.

What this signals

Runtime attack chaining is becoming the practical unit of measurement for AI agent risk, because single-prompt tests miss how real failures emerge across tools, memory, and delegation. Teams that still rely on one-off jailbreak attempts will understate exposure and overstate control maturity.

The governance bar is shifting from access approval to execution verification. If your programme cannot show which agent actions were tested, which tool paths were exercised, and which remediations were re-verified, you do not yet have an operational control model for AI agents.

With 33% of organisations already reporting AI agents accessed inappropriate or sensitive data beyond intended scope, per AI Agents: The New Attack Surface report, practitioners should expect red-team evidence to become a standard input to AI governance and IAM oversight.

For practitioners

Build a complete agent scope register List every harness, tool, data source, memory store, and privilege boundary before testing begins. If you cannot enumerate the agent’s reachable systems and secrets, treat the inventory as incomplete and the test as untrusted.
Test chained abuse scenarios, not isolated prompts Design attacks that move across turns, tools, and indirect injection paths until they reach a concrete harmful outcome such as data exfiltration or unsafe command execution. Use the same chain against each harness your organisation runs.
Attach framework mappings to every finding Map each successful chain to the controls your auditors already use, including OWASP agentic risks, MITRE ATLAS techniques, and NIST AI RMF governance functions. Evidence without a framework label is hard to triage and easier to ignore.
Require pre and post remediation re-tests Rerun the exact attack chain after the fix, then test close variants with different wording, tool order, and indirect injection paths. Close the issue only when the original chain and its variations fail consistently.

Key takeaways

AI agent security testing must be built around reproducible attack chains, not isolated prompt abuse.
The article shows why evidence, framework mapping, and re-testing are the only outputs that meaningfully support governance decisions.
Practitioners should treat delegated tool use as part of identity risk, because that is where agent behaviour becomes operationally dangerous.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		The article maps attack chains to agentic AI risks and tool misuse.
NIST AI RMF		The framework links red-team findings to AI governance and remediation evidence.
OWASP Non-Human Identity Top 10	NHI-01	Agent tool access and delegated runtime authority are non-human identity concerns.

Inventory agent identities and scope their tool access before allowing production execution.

Key terms

AI Agent Red-Teaming: AI agent red-teaming is the practice of deliberately trying to make an agent misuse tools, expose data, or take unsafe actions. It combines adversarial testing with evidence collection so the result can support remediation, governance, and audit review rather than just demonstrate a failure.
Attack Chain: An attack chain is a sequence of prompts, observations, and tool calls that moves an AI agent from a benign starting point to a harmful result. In agent security, the chain matters more than any single prompt because real risk often emerges only when actions accumulate across steps.
Evidence Package: An evidence package is a structured finding record that captures what the agent achieved, how it achieved it, how severe the impact is, and how the issue was remediated and verified. It gives CISOs, auditors, and engineers a shared artifact for decision-making.
Harness: A harness is the runtime wrapper around an AI agent, including the model, tools, data sources, memory, and control logic that shape behaviour. Two agents may look similar from the outside yet behave differently because the harness changes what they can access and how they execute.

Deepen your knowledge

AI agent red-teaming and delegated tool risk are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building an evidence-based governance process for agents, it is worth exploring.

This post draws on content published by ZioSec: Break Your Own AI Agent: A Practical Red-Team Framework for Builders (Part 2). Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-04-23.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org