Security teams should test AI agents through the same interface and runtime path production uses, then validate the tools, memory stores, and downstream sinks those agents can reach. A good program ties each finding to an observed side effect, such as a webhook call, data write, or workflow trigger, rather than treating prompt success or failure as the result.
Why This Matters for Security Teams
AI agents are not just chat interfaces with better memory. Once an agent can call tools, write to systems, and retain state across steps, the security question shifts from “Did the prompt work?” to “What did the agent actually do?” That means red teaming has to follow the agent’s real runtime path, including tool invocation, memory retrieval, and every downstream sink it can reach. Current guidance suggests treating agent behaviour as a chain of side effects, not a text-only exchange.
This is exactly where many teams under-test. A prompt that looks harmless can still trigger a webhook, leak data into a note store, or chain through multiple tools before anyone notices. NHIMG’s research on the AI Agents: The New Attack Surface report shows the scale of the problem: 80% of organisations report agents have already acted beyond intended scope, yet only 44% have implemented policies to govern them. That gap is why red teams need to validate effects, not just model responses. In practice, many security teams encounter agent abuse only after a workflow has already fired, rather than through intentional test coverage.
Frameworks such as the OWASP Agentic AI Top 10 and the NIST AI Risk Management Framework both point toward runtime-aware evaluation, but the operational lesson is simpler: if the agent can act, the test must observe the action.
How It Works in Practice
Effective agent red teaming starts with environment parity. Test the agent through the same API, UI, orchestration layer, and identity path used in production. If the agent uses tool calls, capture those calls. If it reads from memory stores, seed those stores with controlled data and adversarial entries. If it can write to tickets, chats, databases, or webhooks, instrument those sinks so every side effect is visible and attributable.
A practical program usually includes four layers:
- Tool abuse testing, including unexpected parameter combinations, chained actions, and permission boundary checks.
- Memory poisoning and retrieval abuse, especially where long-term memory influences later decisions or tool selection.
- Data exfiltration checks, including indirect leakage through logs, summaries, tickets, or notifications.
- Authority escalation tests, where the agent is induced to request or misuse higher-privilege actions than intended.
Red teams should also test whether the agent respects intent boundaries at runtime. That means the finding is not “the model answered incorrectly,” but “the model caused an unauthorised side effect.” This aligns with the CSA MAESTRO agentic AI threat modeling framework, which emphasises mission context, tool trust, and lifecycle controls, and with NHIMG’s OWASP NHI Top 10, which frames agent access as a non-human identity problem with operational consequences. For advanced programs, align scenarios with the MITRE ATLAS adversarial AI threat matrix to map prompt injection, data poisoning, and tool misuse to concrete attack paths.
These controls tend to break down when agents sit inside loosely governed automations that share credentials, write to multiple systems, and lack end-to-end telemetry, because side effects become hard to isolate and attribute.
Common Variations and Edge Cases
Tighter red team coverage often increases operational overhead, requiring organisations to balance realism against the cost of instrumentation and replay. That tradeoff is real, especially where agents run across third-party tools, shared memory layers, or legacy workflows with limited logging.
Current guidance suggests treating memory differently depending on its role. Short-lived session memory is usually easier to test and reset, while persistent memory creates a harder problem because poisoned context can survive across tasks. There is no universal standard for this yet, but best practice is evolving toward separate controls for transient state, durable memory, and retrieval sources.
Another edge case is multi-agent systems. One agent may appear safe in isolation, but a second agent can amplify the first agent’s actions by reusing outputs, tool results, or delegated permissions. In these environments, tests should include cross-agent trust boundaries, not just single-agent prompts. That is also where frameworks such as the NIST AI Risk Management Framework and the OWASP Top 10 for Agentic Applications 2026 remain useful: they push teams to evaluate governance, not just model output.
For organisations still building their baseline, NHIMG’s report on AI agents as a new attack surface is a useful reminder that the common failure is visibility. If a team cannot see which tools, memories, and sinks were touched, it cannot credibly claim the agent was red teamed.
Standards & Framework Alignment
This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.
OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.
| Framework | Control / Reference | Relevance |
|---|---|---|
| OWASP Agentic AI Top 10 | A2 | Tool misuse and side effects are core agentic AI attack paths. |
| CSA MAESTRO | TM-02 | MAESTRO covers mission context, tool trust, and agent lifecycle threats. |
| NIST AI RMF | AI RMF supports governance, testing, and ongoing risk monitoring for agents. |
Apply AI RMF governance to define test coverage, ownership, and escalation handling.