Why do AI agents and frontier models complicate traditional security testing?

Why This Matters for Security Teams

AI agents and frontier models change the testing problem because they are not single-shot systems. They can plan, remember, call tools, and react to new context, which means a prompt that looks safe in isolation can still lead to risky behaviour later in the session. That is why guidance from the OWASP Agentic AI Top 10 and the NIST AI Risk Management Framework increasingly focuses on runtime behaviour, not only static inputs.

The practical impact is straightforward: traditional appsec testing tends to validate one request, one response, and one permission boundary. Agentic systems can chain actions across multiple prompts, retrieve fresh context, and expose new failure modes when the model is stressed, redirected, or given partial instructions. NHIMG research on OWASP NHI Top 10 also shows why identity and secret handling matter in these workflows, because the attack surface often expands after the model starts acting. In practice, many security teams encounter unsafe agent behaviour only after a real workflow has already combined prompts, tools, and credentials.

How It Works in Practice

Effective testing for AI agents has to evaluate the session, not just the prompt. That means observing how the model behaves across multiple turns, whether it follows unsafe tool calls, whether it can be induced to leak context, and whether it respects boundaries when the task evolves. Current guidance suggests treating the agent as a dynamic workload with a runtime identity, then testing the full chain: input handling, policy checks, tool invocation, and post-action state.

Practitioners are increasingly combining behavioural red-teaming with policy checks grounded in CSA MAESTRO agentic AI threat modeling framework and the MITRE ATLAS adversarial AI threat matrix. The important shift is from “did the model answer safely?” to “did the system stay safe while the model planned, retrieved, delegated, and executed?” That often means:

testing multi-turn prompt chains instead of one-off prompts

verifying tool permissions at runtime, not only at deployment

checking whether the agent can escalate by chaining benign actions

reviewing logs for context leakage, hidden state, and unexpected side effects

using policy-as-code and real-time approval gates for higher-risk actions

NHIMG’s Analysis of Claude Code Security is a useful reference point because it shows why code-focused agent workflows need runtime controls as much as model safety checks. These controls tend to break down when the agent is allowed broad tool access in production and its behaviour depends on unpredictable external data.

Common Variations and Edge Cases

Tighter agent testing often increases latency, engineering effort, and false positives, so organisations must balance coverage against operational friction. Best practice is evolving, and there is no universal standard for how much autonomy is safe to test with automation alone.

Some environments are harder than others. A customer-support agent with read-only tools is different from a developer agent that can write code, open tickets, and trigger deployments. Frontier models also complicate evaluation because small changes in context, prompt structure, or tool output can produce materially different outcomes. That makes regression testing important, but not sufficient. Teams should also validate revocation behaviour, session expiry, and least-privilege boundaries, especially where secrets or long-lived tokens are involved. NHIMG research on the State of Non-Human Identity Security shows how often organisations still struggle with visibility and rotation, which becomes even more urgent when an agent can act continuously. The hardest cases are multi-agent systems and long-running workflows, because failure can emerge only after several apparently harmless steps have already compounded.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	Agentic Top 10	Addresses runtime risks from multi-step agent behaviour and tool misuse.
CSA MAESTRO	Threat modeling	Covers agent-specific attack paths across planning, tools, and autonomy.
NIST AI RMF		Frames AI risk management around governance, measurement, and monitoring.

Use AI RMF to govern testing, monitor behaviour, and measure residual model risk.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Why do AI agents and frontier models complicate traditional security testing?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group