AI red teaming exposes where GenAI guardrails fail in practice

By NHI Mgmt Group Editorial TeamPublished 2026-03-04Domain: Agentic AI & NHIsSource: Lasso Security

TL;DR: AI red teaming shifts security testing from static infrastructure to the model, prompt, plugin, and agent layers, uncovering prompt injection, data leakage, and broken access control risks in real-world GenAI use, according to Lasso Security. The underlying issue is that traditional security assumes stable system behaviour, while AI behaviour changes with context, inputs, and chained tool calls.

At a glance

What this is: AI red teaming is adversarial testing for GenAI systems, and the article argues it is needed because model behaviour, prompts, and integrations create risks that traditional penetration testing misses.

Why it matters: It matters to IAM practitioners because AI systems increasingly rely on scoped access, plugin calls, and delegated workflows that must be governed like identities, not just tested like software.

👉 Read Lasso Security's guide to AI red teaming types, components, and best practices

Context

AI red teaming is the practice of simulating adversarial pressure against GenAI systems to expose weaknesses before attackers do. In identity terms, the concern is not just whether the model answers safely, but whether prompts, plugins, APIs, and autonomous workflows can be driven beyond their intended authority.

That shift matters because AI risk now includes data leakage, over-privileged tool use, and control bypass across the full interaction chain. For IAM and NHI teams, the governance question is no longer limited to access at login or deployment time, but to what the model is allowed to do at runtime and under what conditions.

Key questions

Q: What breaks when AI red teaming is not part of GenAI governance?

A: Without AI red teaming, organisations usually discover failures only after a model has already exposed data, bypassed a guardrail, or triggered an unsafe API call. The main breakdown is false confidence: teams assume standard application testing covers the model, but the real risk sits in prompt handling, tool delegation, and context-sensitive behaviour.

Q: Why do GenAI systems complicate identity and access control?

A: GenAI systems complicate identity and access control because they can turn a single user request into a chain of delegated actions across plugins, APIs, and service accounts. Access decisions are no longer tied only to the user at login. They must also account for what the model can do at runtime and what authority it inherits downstream.

Q: How do security teams know if AI red teaming is working?

A: AI red teaming is working when testing finds real prompt injection paths, over-scoped integrations, and policy gaps before attackers do, and when fixes are re-tested successfully after model or workflow changes. The strongest signal is repeatable reduction in exposed authority, not a lower number of red-team findings on its own.

Q: Should organisations treat AI plugins like privileged access?

A: Yes. AI plugins and connected APIs can act on behalf of the model, so they should be governed as privileged access paths with narrow scopes, explicit approval boundaries, and continuous review. If a plugin can reach customer records or production systems, it belongs in the same governance conversation as PAM and NHI controls.

Technical breakdown

Prompt injection and adversarial inputs in LLM workflows

Prompt injection occurs when hidden or adversarial instructions alter a model’s behaviour, often through user prompts, uploaded files, or retrieved content. In GenAI systems, the model may treat untrusted input as if it were a command, which makes policy boundaries far weaker than they look on paper. Indirect prompt injection is especially dangerous because the payload can ride in documents or API responses and be executed during normal workflow processing. The security failure is not just content toxicity. It is the collapse of trust boundaries between instruction, data, and execution context.

Practical implication: separate trusted system instructions from untrusted content and test every retrieval and upload path for instruction smuggling.

Broken access control in AI plugins and agent workflows

AI plugins and agent workflows extend model capability into external systems, but each connection creates a new authorisation surface. The model may be able to request an action, yet the downstream API or plugin decides whether that action is allowed. If scopes are too broad or tokens are reused across steps, a harmless prompt can become a privileged transaction. This is why red teaming must examine chained integrations, not just model outputs. The real control boundary is often the API token, delegated scope, or service account behind the agent, not the chatbot front end.

Practical implication: review delegated scopes, token reuse, and API permissions for every AI-connected workflow before allowing production use.

Continuous red teaming for autonomous AI systems

The article’s emphasis on autonomous red-teaming points to a basic operational problem: AI risks evolve too quickly for annual tests. Continuous red teaming uses repeated adversarial simulations to catch prompt bypasses, hallucination-driven misuse, and privilege abuse as models change. That approach is closer to ongoing control validation than a one-time assessment. It also aligns with NIST AI RMF thinking, where governance, mapping, and measurement must be continuous when model behaviour is probabilistic and context-sensitive. Static test results age quickly in these environments.

Practical implication: treat AI security testing as a recurring control-validation cycle, not a pre-launch checklist.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

AI red teaming is really identity testing for delegated machine behaviour. The article describes a world where models, copilots, plugins, and agent workflows can trigger actions beyond the intent of the original prompt. That is not just application security, because the useful unit of analysis is the delegated authority chain behind the model. Practitioners should read this as a governance signal that AI systems are now exercising identity-like behaviour in production.

Prompt injection exposes a trust boundary problem, not just a content-filter problem. Hidden instructions in files, messages, and retrieved content succeed when systems fail to distinguish instruction from data. That makes the governance issue structural, because the model is being asked to evaluate untrusted material inside the same execution context as trusted policy. For practitioners, the control question is where instruction authority actually lives.

Broken access control in AI workflows is often really over-scoped delegation. When plugins or downstream APIs accept broad tokens, the model inherits permissions that no human operator would be granted in a single task. This is especially relevant to OWASP-NHI thinking, because the risky object is usually the service identity or token behind the agent. Practitioners should treat the agent stack as an NHI governance problem with AI-specific failure modes.

Continuous red teaming is the only realistic response to probabilistic behaviour. Annual testing assumes the control surface is stable, but GenAI systems change with prompts, model updates, data, and tools. That makes one-off assessments inadequate for environments where the same input can produce different outcomes under slightly different context. For practitioners, the lesson is that AI assurance must become an operating control, not a project milestone.

Policy-oriented red teaming creates the bridge between AI safety and IAM governance. The article’s compliance scenarios show that model misuse can create privacy, fairness, and access-control failures at the same time. That is where NHIMG’s cross-domain view matters: the enterprise does not get separate risk registers for model behaviour, identity delegation, and policy enforcement. Practitioners should align testing with the actual authority the system can exercise.

From our research:
85% of organisations lack full visibility into third-party vendors connected via OAuth apps, according to The State of Non-Human Identity Security.
Lack of credential rotation is cited as the top cause of NHI-related attacks by 45% of organisations, followed by inadequate monitoring and logging at 37% and over-privileged accounts at 37%.
That visibility gap matters here because red teaming only works when teams can see which identities, tokens, and delegated tools the model can actually reach, as shown in 52 NHI Breaches Analysis.

What this signals

Delegated authority chain: AI security teams should treat every model-to-tool connection as an identity boundary, not just an integration. If the workflow can reach records, send messages, or call APIs, then red team findings should map directly to permissions, not only to model behaviour.

The practical shift is toward continuous control validation, because static test reports age quickly in GenAI environments. When prompts, connectors, and model versions change, the test surface changes with them, so operational teams need a standing process for retesting and evidence capture.

With 85% of organisations lacking full visibility into third-party vendors connected via OAuth apps, per The State of Non-Human Identity Security, AI programmes that depend on external tools are already operating in a visibility gap. That gap is where agent workflows become difficult to govern and even harder to audit.

For practitioners

Map every AI workflow to its delegated authority chain. Document which prompts, plugins, APIs, service accounts, and downstream systems can be reached from each GenAI application, then identify where authority expands beyond the original use case. Pay special attention to over-permissioned scopes and shared credentials.
Test for prompt injection across all untrusted inputs. Run adversarial tests against uploaded files, retrieved documents, chat inputs, and API responses to verify whether hidden instructions can override system behaviour. Include encoded, obfuscated, and multilingual payloads in the test set.
Review plugin and API permissions as identity controls. Treat AI-connected integrations as privileged access paths and inspect token reuse, scope breadth, and call chaining. If an agent can trigger a backend action, the downstream permission model needs the same scrutiny as PAM or NHI access.
Move AI assurance into continuous validation. Schedule recurring red-team exercises and re-run tests after model updates, prompt changes, connector additions, or policy changes. Track findings in a way that shows whether guardrails still hold under live adversarial pressure.

Key takeaways

AI red teaming exposes the gap between model safety assumptions and the reality of delegated machine action.
The biggest risks sit in prompt injection, over-scoped plugins, and chained API calls, not in model output alone.
Enterprises need continuous validation because GenAI risk changes whenever the prompt, model, or connected tool changes.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Covers prompt injection, tool misuse, and agent workflow abuse in AI systems.
NIST AI RMF		Addresses governance and measurement for probabilistic AI behaviour.
OWASP Non-Human Identity Top 10	NHI-03	Over-scoped tokens and delegated access are core NHI risks in AI workflows.

Test agent workflows for prompt injection and tool abuse before allowing production delegation.

Key terms

AI Red Teaming: AI red teaming is adversarial testing for models, prompts, and connected workflows. The goal is to expose unsafe behaviour, data leakage, and access-control failures before attackers exploit them. In GenAI environments, the test surface includes the model, its instructions, and every tool it can call.
Prompt Injection: Prompt injection is the use of hidden or adversarial instructions to steer a model away from its intended behaviour. It can appear in user input, files, web content, or API responses. The failure is a broken trust boundary between data that should be read and instructions that should be obeyed.
Delegated Authority Chain: A delegated authority chain is the sequence of identities, tokens, plugins, and APIs that an AI system can use to act on behalf of a request. It matters because the model may not hold the permissions itself, but it can still trigger actions through downstream credentials and scoped access.
Guardrail: A guardrail is a policy, filter, or control designed to keep an AI system within acceptable behavioural limits. In practice, guardrails must be tested against adversarial inputs and runtime tool use, because controls that look sound in design can fail when context changes during execution.

Deepen your knowledge

AI red teaming, prompt injection testing, and delegated access review are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building governance around GenAI workflows and connected identities, it is worth exploring.

This post draws on content published by Lasso Security: What is Red Teaming in AI? Types, Components & Best Practices. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-03-04.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org