AI red teaming is exposing gaps in model and agent governance

By NHI Mgmt Group Editorial TeamPublished 2025-07-17Domain: Agentic AI & NHIsSource: TROJ.AI

TL;DR: AI red teaming simulates adversarial prompts, jailbreaks, data extraction, and model evasion to expose failures in models, applications, and agents before production exposure, according to TROJ.AI. The governance issue is broader than testing quality: security teams need continuous, lifecycle-aware controls that account for changing model behaviour and agentic risk.

At a glance

What this is: AI red teaming is a structured way to test AI models, applications, and agents for prompt injection, jailbreaks, leakage, and unsafe behaviour before production use.

Why it matters: It matters because AI systems are now making decisions that affect access, fraud, and operations, which means IAM, NHI, and governance teams need controls that fit probabilistic behaviour, not just static software.

👉 Read TROJ.AI's AI red teaming guide for models, applications, and agents

Context

AI red teaming is the practice of deliberately attacking AI models, applications, and agents to find failure modes before those systems are relied on in production. In identity terms, the problem is not just model quality, but whether the organisation can govern runtime behaviour that changes with prompts, data, and context.

Traditional application security assumes the attack surface is relatively stable and that failures can be fixed by patching code or tuning policy. AI systems are different because the risky behaviour can emerge from interaction, not just from software defects, which is why AI security teams need testing methods that match the way these systems actually behave.

For identity leaders, the real question is where current governance stops working once AI begins making decisions that affect fraud, supply chain, customer access, or operational routing. That is why red teaming sits at the intersection of AI risk, NHI governance, and access control design.

Key questions

Q: How should security teams red team AI systems that can use tools?

A: Security teams should test the full runtime path, not just the model’s text output. That means adversarial prompts, tool-call abuse, approval bypass attempts, and downstream side effects on APIs, workflows, and data stores. If the AI can act, the red team must verify what it can reach, what it can change, and where human review still applies.

Q: Why do AI systems require different security testing than traditional software?

A: AI systems can fail through interaction, retrieval, and probabilistic behaviour rather than only through code defects. A model may respond differently to hidden prompts, external content, or changing context, which makes static scanning insufficient. Security teams need adversarial testing because risk can emerge after deployment, not only during development.

Q: What breaks when AI red teaming is treated as a one-time exercise?

A: A one-time test misses behavioural drift, new integrations, changing prompts, and expanding tool access. That creates a false sense of assurance because the system that passed yesterday may behave differently tomorrow. Continuous validation is necessary whenever the model, its data, or its permissions change.

Q: Who should own governance for AI models and agents that affect access decisions?

A: Ownership should sit with the teams that govern risk, identity, and security outcomes together, not with model development alone. When an AI system influences access, fraud, or workflow execution, IAM, PAM, and AI security stakeholders need a shared control model with clear accountability for approval boundaries and lifecycle change.

Technical breakdown

Prompt injection and jailbreak testing in AI systems

Prompt injection attempts to override or steer a model by placing malicious instructions inside user input, retrieved content, or tool data. Jailbreaking is the broader effort to bypass guardrails so the system produces disallowed outputs or takes unintended actions. These tests matter because many AI applications mix natural language with hidden system prompts, policy rules, and tool calls, which creates multiple layers of trust in a single execution path. Red teams use targeted prompts and scenario chaining to see whether the model obeys the attacker, the policy, or the intended workflow.

Practical implication: test every AI workflow that consumes external text or tool output for instruction-following failures before it reaches users.

Data extraction, leakage, and model behaviour drift

Data extraction tests whether a model can be pushed into revealing sensitive training data, prompt content, or user information. Leakage often appears when the model has memorised patterns from source material or when retrieval and logging paths expose data indirectly. Behaviour drift is the operational risk that a model becomes less predictable over time as prompts, fine-tuning, or connected data sources change. That means a one-time approval is not enough. Security teams need to treat model behaviour as something that can change after deployment, not only before release.

Practical implication: re-test models after prompt, data, or retrieval changes, not just after code releases.

AI agents, tool use, and runtime control boundaries

AI agents extend the risk surface because they can choose actions, call tools, and chain steps across systems. The security issue is not only what the model says, but whether it can trigger external side effects through APIs, workflows, or delegated access. When agent behaviour is combined with broad permissions, a single prompt can become an access or action pathway. That is where red teaming moves from content testing into control testing, because the real failure is often unbounded authority rather than unsafe language alone.

Practical implication: red team the full agent tool chain, including delegated permissions, approval gates, and downstream side effects.

NHI Mgmt Group analysis

AI red teaming is becoming a control test for governance assumptions, not just a security exercise. The article shows that modern AI systems fail in interaction, not only in code. That means governance has to account for prompts, retrieved data, tool use, and behaviour drift as part of the control surface. For practitioners, the key shift is from testing whether a model is safe in theory to testing whether its operating context stays governable in practice.

Agentic AI changes the boundary of what identity controls must protect. Once an AI system can select tools and trigger actions, the relevant risk is no longer limited to model output quality. Access scope, delegated authority, and approval boundaries become part of the red team target set. For identity programmes, this connects AI risk directly to NHI governance and makes runtime authorisation a first-class concern.

Prompt testing alone is not enough when the system can act on what it learns. AI red teaming must include downstream effects such as API calls, workflow execution, and data movement. The article implicitly shows why content safety and action safety are different control problems. Practitioners should treat any AI system with tool access as both a model and an identity-bearing actor.

Continuous testing is the only defensible posture when model behaviour changes over time. The source correctly frames red teaming as ongoing rather than one-and-done. That matters because models, prompts, policies, and connected data sources all shift after launch. For security leaders, the lesson is to align assurance with lifecycle change, not with release milestones.

AI red teaming belongs inside the broader identity lifecycle conversation. Red teaming reveals where access, authority, and governance drift beyond what was intended at approval time. That makes it relevant to IAM, PAM, and NHI lifecycle processes, especially where humans delegate decisions to systems that can keep operating without direct review. The practitioner implication is to govern AI behaviour as a lifecycle problem, not a point-in-time assessment.

From our research:
Only 1.5 out of 10 organisations are highly confident in their ability to secure NHIs, compared to nearly 1 in 4 for securing human identities, according to The State of Non-Human Identity Security.
Lack of credential rotation is cited as the top cause of NHI-related attacks by 45% of organisations, followed by inadequate monitoring and logging at 37%, according to The State of Non-Human Identity Security.
For teams expanding AI into production, the next step is to pair adversarial testing with lifecycle governance and workload identity controls, using the Ultimate Guide to NHIs , What are Non-Human Identities as the baseline reference.

What this signals

Runtime testing is becoming a governance requirement, not a specialist AI exercise. As AI systems move closer to business decisions, teams should expect red teaming to sit alongside access review, policy enforcement, and change control. The control question is no longer whether the model looks safe in a demo, but whether the production path still behaves safely after prompts, data, and tool access change.

The main programme risk is false assurance. AI security often looks complete when teams test a model in isolation, but the real exposure appears in the connected system, especially where an AI can call tools or move data across identities and workflows.

The governance standard will increasingly be continuous validation linked to lifecycle events. For practitioners, that means every new integration, permission change, or prompt update should trigger a fresh review of whether the system still operates within its intended boundary.

For practitioners

Build adversarial test cases for AI workflows Create prompt injection, jailbreak, and leakage scenarios for every AI path that accepts external text, retrieved content, or user uploads. Include cases that try to redirect the model into unsafe tool calls or policy bypasses.
Test delegated tool access, not just model output Map which APIs, workflows, and data stores an AI agent can reach, then red team the full execution chain for unintended side effects. If the system can act, the test must include what happens after the answer is generated.
Re-run security tests after behavioural changes Treat prompt updates, retrieval changes, fine-tuning, and new integrations as security events that require retesting. Behavioural drift should trigger fresh validation because a previously safe model can become unsafe without a code change.
Tie AI assurance to identity governance Assign clear ownership for AI systems that make or influence decisions, and align review cadence with access scope, delegated authority, and lifecycle changes. This is especially important where AI is connected to human approvals or non-human identities.

Key takeaways

AI red teaming exposes how models, agents, and connected workflows can fail in ways traditional application testing does not catch.
The operational risk is not limited to unsafe text generation, because delegated tool access can turn a model failure into an access or action failure.
Security teams should treat red teaming as a continuous governance process tied to lifecycle change, permission scope, and runtime behaviour.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Covers prompt injection, jailbreaks, and agent tool misuse described in the article.
NIST AI RMF		AI governance and lifecycle oversight fit the article's continuous assurance message.
OWASP Non-Human Identity Top 10	NHI-03	Agent tool access and delegated permissions create NHI-style control risks.

Review delegated credentials and runtime access scopes before connecting AI systems to tools.

Key terms

AI Red Teaming: AI red teaming is the deliberate testing of a model, application, or agent with adversarial scenarios to expose failures before production use. It focuses on behaviour under pressure, including unsafe outputs, prompt manipulation, leakage, and tool misuse, rather than only scanning code or configuration.
Prompt Injection: Prompt injection is an attack that uses crafted input to override the system's intended instructions and influence model behaviour. In practice, it matters most when untrusted text, retrieved content, or user uploads can alter responses, tool calls, or policy decisions.
Behavioural Drift: Behavioural drift is the change in a model's responses, risk profile, or control boundary after deployment because prompts, data, integrations, or fine-tuning have changed. It is a governance problem because a system can become less predictable without any obvious code release.
AI Agent: An AI agent is a software entity that can choose actions, tools, and execution timing at runtime. When that independence is present, the agent must be governed as an identity-bearing actor, because its authority can create security impact beyond the model output itself.

What's in the full article

TROJ.AI's full blog post covers the operational detail this post intentionally leaves for the source:

Specific examples of prompt injection, jailbreak, and leakage test scenarios that practitioners can adapt to their own AI stack
The article's step-by-step breakdown of how red team findings are prioritised and turned into remediation work
Guidance on when to retrain a model versus when to apply downstream guardrails and access controls
The vendor's explanation of how continuous AI red teaming fits into the development lifecycle

👉 TROJ.AI's full post covers attack types, testing methods, and remediation patterns in more operational detail.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or governance maturity in your organisation, it is worth exploring.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-07-17.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org