Autonomous red teaming exposes the LLM security benchmark gap

By NHI Mgmt Group Editorial TeamPublished 2026-05-14Domain: Agentic AI & NHIsSource: Lasso Security

TL;DR: Autonomous red teaming is being positioned as a way to stress-test LLMs and agentic systems before attackers do, while OWASP updates on excessive agency, system prompt leakage, and RAG weaknesses show why traditional red teaming leaves blind spots, according to Lasso Security and OWASP. The real issue is not just model testing, but the lack of clear ownership and benchmarks for systems that now change behaviour, expose prompts, and pull external data at runtime.

At a glance

What this is: Lasso Security argues that autonomous red teaming addresses the growing gap between LLM deployment speed and the security testing models enterprises still rely on.

Why it matters: IAM and security teams need to treat LLMs as governed identity-bearing systems because autonomy, prompt exposure, and retrieval access all change how access, testing, and accountability have to work.

👉 Read Lasso Security's analysis of autonomous red teaming for LLM security

Context

Autonomous red teaming is a way to test LLMs by simulating attacks continuously instead of waiting for manual security reviews. The governance gap is simple: enterprises are deploying models and agentic workflows faster than they are assigning ownership, setting baselines, or defining what “secure enough” means for runtime behaviour.

That matters for identity because LLMs increasingly sit behind credentials, access paths, and delegated tool use. Once a model can reveal prompts, draw on retrieval sources, or trigger downstream actions, the security problem is no longer just model quality. It becomes identity, privilege, and control scope across the full execution path.

For teams building programmes around agentic AI, the relevant question is not whether the model is clever enough to fool an attacker. It is whether current IAM, PAM, and NHI governance can define and test the boundaries of what the model is allowed to see, use, and influence in production.

Key questions

Q: How should security teams test LLMs that can access tools and external data?

A: Security teams should test LLMs by simulating real runtime abuse, not only prompt injection. That means validating how the model handles hidden instructions, retrieval poisoning, tool calls, and unsafe output escalation. The right question is whether the system can be constrained when context changes, because that is where the operational risk emerges.

Q: Why do traditional red team exercises miss so many AI security issues?

A: Traditional red team exercises often miss AI security issues because they assume fixed logic, predictable change windows, and static attack surfaces. LLMs can alter behavior through prompts, retrieved data, and connected tools during normal use. That makes the real control problem continuous runtime governance, not only one-time testing.

Q: When does agentic AI become a governance problem rather than a model-quality problem?

A: Agentic AI becomes a governance problem when the system can select actions, influence downstream workflows, or access data beyond a narrow prompt-response cycle. At that point, the risk is not only what the model says, but what it is able to do with that access in production.

Q: What should organisations do first when securing LLMs and AI agents?

A: Organisations should start by defining ownership, permitted actions, and the boundaries of model access. Before advanced controls, they need to know who is accountable for prompts, retrieval sources, and tool permissions. Without that baseline, testing and monitoring cannot be tied to a clear security decision.

Technical breakdown

Why traditional red teaming struggles with LLM and agentic AI systems

Traditional red teaming assumes a bounded target, a stable attack surface, and a test cycle that can be planned around change windows. LLM deployments break all three assumptions. Models are updated frequently, used across multiple clouds and applications, and exposed to dynamic inputs that shift the risk profile during normal operation. That makes static testing incomplete, especially when the system is connected to external data, tools, or downstream workflows. The security issue is not that red teaming is obsolete. It is that the object being tested now behaves more like a living service than a fixed application.

Practical implication: security teams need testing coverage that tracks model changes, tool connections, and retrieval paths continuously rather than on a quarterly cadence.

Excessive agency and why model autonomy changes the risk model

Excessive agency describes a model or agent that can take actions with real effects, not just produce text. In practice, that means the security boundary must account for decisions, tool calls, and execution side effects, not only prompt content. Once a model can choose actions or trigger workflows, the issue becomes who authorised the action, under what scope, and whether the system can be constrained before it acts. This is why agentic AI creates a different governance problem from ordinary software automation. The control model has to include runtime behaviour, not only deployment settings.

Practical implication: map every agent action path to an explicit approval, scope, or containment control before the system reaches production.

System prompt leakage and RAG weaknesses as identity-adjacent failure modes

System prompt leakage matters because the prompt often contains operating rules, role boundaries, and hidden instructions that shape model behaviour. If an attacker exposes that content, they gain a blueprint for bypassing safeguards or steering responses. Retrieval-augmented generation adds another risk layer because external knowledge sources can be manipulated, poisoned, or overexposed if access controls and data validation are weak. These are not abstract model problems. They are governance failures at the boundary between identity, content, and data access, where the system’s trust decisions become visible to attackers.

Practical implication: treat prompt material and retrieval sources as governed assets, with access control, integrity checks, and monitoring tied to the model runtime.

Threat narrative

Attacker objective: The attacker aims to turn a trusted AI system into a controllable execution or disclosure layer that leaks sensitive information and widens the blast radius of downstream access.

Entry begins when an attacker targets exposed prompt content, retrieval sources, or connected model inputs that the LLM trusts during normal operation.
Escalation occurs when the attacker uses that trusted context to manipulate outputs, trigger inappropriate tool use, or extract hidden operating rules that widen access.
Impact follows when the compromised model behavior causes data leakage, unsafe actions, or downstream workflow abuse across the enterprise environment.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Autonomous red teaming is becoming the only credible way to test LLMs that change behavior at runtime. Manual red teaming still matters, but it was built for systems whose boundaries could be enumerated and retested on a schedule. LLMs with dynamic retrieval, prompt-driven behaviour, and tool access alter those boundaries continuously. The implication is that security assurance for AI systems now has to be continuous, not episodic.

Excessive agency is the right named concept for the failure mode this category exposes. The problem is not simply that an AI system is powerful, but that it can take actions whose effects outlive the prompting event. That shifts governance away from content review and toward runtime control of action scope, execution authority, and downstream side effects. Practitioners should read this as a model of action risk, not just model output risk.

System prompt leakage is no longer a niche data-loss issue. It is a control-plane exposure. When hidden instructions reveal how a model is governed, attackers gain a map of the decision logic they need to bypass. That is especially relevant when LLMs are connected to business data or internal tools. The practitioner conclusion is that prompts, retrieval rules, and agent instructions must be treated as governed identity-adjacent assets.

The strongest signal here is not the feature itself but the gap between LLM deployment speed and governance maturity. Enterprises are moving faster than they can define benchmarks, ownership, and acceptable behaviour for model testing. That creates an assurance deficit that traditional application security cannot close on its own. Security leaders need to treat AI assurance as a standing operating model, not a one-time project.

AI security and identity security are converging around runtime trust boundaries. Once models can read sensitive context, select tools, and influence downstream actions, the question becomes who or what is authorised to do what at runtime. That is the same governance problem identity teams already manage for privileged humans and NHIs, only now the actor can alter its own path through the system. Practitioners should align AI testing with identity governance, not leave it as a separate discipline.

From our research:
98% of companies plan to deploy even more AI agents within the next 12 months, despite documented rogue behaviour in 80% of current deployments, according to AI Agents: The New Attack Surface report.
Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation.
That gap makes continuous runtime testing the practical next step, and OWASP NHI Top 10 is the right framework to anchor it.

What this signals

Autonomous red teaming will become a baseline control for AI governance rather than a specialist add-on. As LLMs move closer to business workflows, the relevant control question shifts from whether a model was tested once to whether its behavior is being revalidated as prompts, retrieval sources, and tool connections change. Security teams should prepare for AI assurance to sit alongside identity governance and application control, not outside them.

Excessive agency is the concept practitioners should watch most closely. Once an AI system can influence execution instead of merely generating content, the organisation has created a runtime trust boundary that must be monitored like any other high-risk identity path. That means model owners, security operations, and IAM teams need shared visibility into actions, approvals, and changes.

At the scale problem level, the SailPoint AI Agents: The New Attack Surface report shows 80% rogue behaviour in current deployments and 52% audit visibility, which is enough to make governance a present-tense issue. The programme response is to anchor testing, monitoring, and ownership in the same operational model rather than split them across separate AI and IAM teams.

For practitioners

Define model ownership before deployment Assign a named owner for each LLM or agentic workflow, including responsibility for prompts, retrieval sources, tool access, and test cadence. Governance fails fastest when no team owns the runtime boundary.
Test runtime behaviour, not just output quality Build red-team scenarios that probe prompt leakage, retrieval poisoning, tool misuse, and unsafe autonomous actions. Include scenarios where the model can alter the chain of events after its initial response.
Treat prompts and retrieval sources as governed assets Restrict access to system prompts, retrieval indices, and agent instructions with the same care applied to privileged configuration and secrets. Track who can modify them and who can read them.
Map model actions to explicit guardrails For every action path, define whether the model can observe, suggest, request approval, or execute. If the boundary is unclear, the model has more authority in practice than the programme intends.

Key takeaways

LLM security now depends on runtime governance because model behavior can change when prompts, retrieval, and tool access change.
Traditional red teaming leaves blind spots when it cannot continuously test the ways AI systems act, not just what they say.
The control gap is ownership and accountability for model actions, prompts, and data access, which should be defined before deployment.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	NHI-01	Agentic autonomy and tool misuse are central to the red-teaming problem described here.
OWASP Non-Human Identity Top 10	NHI-03	The post covers access, prompt, and retrieval governance for non-human AI identities.
NIST AI RMF		AI governance, measurement, and accountability apply directly to autonomous red teaming.

Treat prompts, retrieval indexes, and credentials as governed NHI assets with reviewable access.

Key terms

Autonomous Red Teaming: A continuous testing approach that simulates adversarial behavior against AI systems without relying on manual, one-off exercises. It matters because LLMs change through prompts, retrieval, and tool access, so security testing has to reflect runtime behavior rather than a fixed application state.
Excessive Agency: A failure mode where an AI system can take actions with real operational impact beyond what the governance model expected. In practice, it means the model is not just producing content. It is influencing execution, which demands controls for scope, approvals, and downstream effects.
System Prompt Leakage: The exposure of hidden instructions that govern how an LLM should behave, respond, or constrain itself. These prompts often carry the model’s operational logic, so leakage can reveal guardrails, privileged context, and the decision rules an attacker needs to bypass.
Retrieval-Augmented Generation: A design pattern where an LLM pulls in external knowledge at runtime to improve relevance and freshness. The security challenge is that retrieval sources become part of the trust boundary, so access control, integrity, and monitoring must cover the data the model reads as well as what it outputs.

Deepen your knowledge

Autonomous red teaming and LLM security governance are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are working through AI model ownership and runtime access control, it is worth exploring.

This post draws on content published by Lasso Security: Strengthening LLM Security from the Get-Go, autonomous red teaming in action. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-05-14.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org