AI red teaming shows why deterministic security no longer fits

By NHI Mgmt Group Editorial TeamPublished 2026-04-15Domain: Agentic AI & NHIsSource: Lakera

TL;DR: AI systems shift behavior with context, prompts, model updates, and tool use, so static tests miss the failures that emerge only during interaction, according to Lakera. Security now has to govern behavior across the full AI application lifecycle, not just model outputs.

At a glance

What this is: This is an analysis of why AI red teaming has to move from point-in-time testing to continuous, application-specific stress testing as AI systems become non-deterministic and agentic.

Why it matters: It matters to IAM and security teams because the control problem shifts from protecting fixed code paths to governing runtime behavior, tool use, and permissioned actions across AI, NHI, and human-operated workflows.

👉 Read Lakera's analysis of AI red teaming for non-deterministic systems

Context

Non-deterministic AI systems do not behave like traditional software, which is why static assurance breaks down. Once prompts, context, model updates, and tool calls shape the outcome, the old assumption that the same input reliably produces the same result no longer holds, and the primary keyword here is AI red teaming.

For IAM and security programmes, the issue is not only model quality. It is whether the system can be trusted to stay within its intended boundary as it interacts with data, APIs, and permissions, especially when autonomous actions turn a bad prompt into a real-world action path.

Lakera frames this as a shift from checking for obvious jailbreaks to testing how a specific AI application behaves under pressure. That is a much broader governance problem than model safety alone, and it is atypical for teams still relying on one-off validation exercises.

Key questions

Q: How should security teams red team non-deterministic AI systems?

A: They should test AI systems continuously across design, pre-release, and post-deployment phases, because behaviour can drift after model updates or tool changes. The test cases need to include prompt variation, indirect injection, and realistic tool chains so the team measures what the system actually does, not only what it says.

Q: Why do static tests fail for AI red teaming?

A: Static tests fail because AI behaviour is shaped by context, phrasing, and evolving model states, so a one-time benchmark cannot capture emergent failures. A system that looks safe in a fixed suite may still be exploitable when attackers change the wording, the retrieval context, or the action path.

Q: What breaks when AI systems can call tools autonomously?

A: The boundary between bad output and bad action breaks down. Once the system can send emails, update records, or run code, a successful prompt attack can become an operational incident, so security teams must govern tool permissions, action limits, and approval paths as part of the AI control model.

Q: How do organisations know if AI red teaming is working?

A: They know it is working when testing finds new failure modes before production does, and when the same controls are revalidated after model updates or tool additions. A mature programme produces repeatable evidence that the system still behaves within the intended boundary as it changes.

Technical breakdown

Why non-deterministic AI breaks point-in-time assurance

Traditional software security assumes stable behaviour. If the same request produces different outcomes because the model weights changed, the prompt shifted, or the context window altered the decision, then the assurance target is no longer a fixed code path. Red teaming must therefore test for emergent behaviour, not just known bad inputs. In AI systems, the prompt acts like executable influence over the application, which makes semantic ambiguity part of the attack surface. Static benchmarks miss the interactions that only appear under realistic load, changing context, or chained tool use.

Practical implication: treat red teaming as continuous validation across releases, not a one-time control check.

Prompt injection and semantic manipulation in AI applications

Prompt injection works because the model interprets language by intent and context rather than rigid syntax alone. That means a malicious instruction can be hidden inside otherwise valid content and still steer the system. The risk is not limited to offensive text generation. In application contexts, the model may take the manipulated meaning and apply it to data retrieval, workflow steps, or tool calls. This is why simple keyword filters and web application firewalls are weak defenses here: they do not understand the intended behaviour the attacker is trying to induce.

Practical implication: test the full application path, including prompts, retrieved content, and tool outputs, not just the model response.

Agentic AI and the expansion of the attack surface

When AI systems can write to databases, send messages, or execute code, the security question changes from what the model says to what the system does. Agentic systems create compound risk because the model, system prompt, tool-calling logic, and final action chain all matter at once. A successful injection in a chatbot may be embarrassing, but in an autonomous agent it can become an operational breach. The underlying issue is that permissions and tool access give language real operational force, so red teaming has to map how an attacker could turn intent manipulation into unauthorized action.

Practical implication: model the full tool-calling and permission chain as part of the attack surface.

Threat narrative

Attacker objective: The attacker aims to convert language-level influence into an operational action that the AI system should never have taken.

Entry occurs through manipulated natural language, where an attacker plants or disguises malicious intent inside a prompt, retrieved document, or other model input.
Escalation happens when the model interprets that language as instruction and propagates it into tool selection, workflow execution, or data access.
Impact follows when the AI system performs the wrong action, exposing data, authorising an unsafe transaction, or executing code outside intended governance.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

AI red teaming is now a governance discipline, not a security side exercise. The article shows that point-in-time validation fails when model behaviour shifts with phrasing, context, and updates. That means the control objective is no longer simple detection of bad prompts, but ongoing assurance that the application still behaves within its intended boundaries. Practitioners should treat red teaming as a permanent operating model, not a release checkpoint.

The real control gap is semantic, not syntactic. Traditional filters and deterministic testing assume attacks look like malformed inputs. In AI applications, the attack often looks like normal language that carries abnormal intent, which means the security boundary sits inside the meaning layer of the application. The implication is that governance teams must evaluate how the system interprets instruction, not just whether it accepts text.

Agentic AI turns non-human identity governance into execution governance. Once an AI system can call tools, write data, or send messages, the issue is no longer whether it can answer safely but whether it can act safely. That moves the discussion from model safety into permission scope, tool trust, and action containment. Practitioners should map AI behaviour as an identity and access problem, not only as an AI quality problem.

Continuous adversarial evaluation is the only stable assurance model for drifting AI systems. The article is explicit that models can become vulnerable after a silent update or a new tool integration. That creates an identity governance lesson across AI, NHI, and human-managed environments: controls built for stable systems lose relevance when the actor changes under runtime conditions. The implication is that assurance must track behaviour over time, not just configuration at deployment.

Runtime trust debt: the accumulated gap between what the AI system was tested for and what it can now do. Every prompt change, model update, and tool addition increases that debt if evaluation does not keep pace. This is the field-level issue teams need to name because it explains why yesterday's clean test result no longer proves today's safety. Practitioners should assume behavioural drift until continuous testing proves otherwise.

From our research:
The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
43% of security professionals are concerned about AI systems learning and reproducing sensitive information patterns from codebases, which shows that AI risk is already being treated as an identity and data governance issue.
For a deeper lifecycle view, read NHI Lifecycle Management Guide for how access, visibility, and offboarding controls need to keep pace with changing identity behaviour.

What this signals

Runtime trust debt: AI programmes accumulate assurance gaps every time the model changes, a new tool is connected, or the prompt set is revised without a matching retest. The practical signal for teams is that AI governance now needs release-linked evaluation and drift monitoring, not annual validation cycles.

As organisations extend AI into workflows, the boundary between application security and identity governance gets thinner. That is why the NIST Cybersecurity Framework 2.0 remains relevant here, especially where continuous protection and response need to account for changing system behaviour.

The programme-level implication is simple: if an AI system can act, then security controls must verify action scope, not just content safety. Teams managing identity, permissions, and workflow automation should prepare for more frequent tool reviews and tighter approval paths as agentic use grows.

For practitioners

Shift red teaming to continuous evaluation Run adversarial tests during design, pre-release regression, and post-deployment drift monitoring so the control follows the system as it changes. Tie each test cycle to the current model, prompt set, and tool permissions.
Map the full AI application attack surface Document the foundation model, system prompt, retrieval sources, external APIs, and action endpoints together so testing covers the full path from input to impact. Reassess that map whenever a tool is added or permissions change.
Test for semantic bypass, not just bad strings Create scenarios where benign-looking language carries malicious instruction, then verify how the system handles the hidden intent across downstream actions. Include indirect prompt injection through content the model may trust.
Contain agentic permissions to the minimum action set Limit write, send, and execute privileges to the narrowest task scope possible, and separate high-impact tools from conversational interfaces. If the model can initiate action, human review should be mandatory before completion.

Key takeaways

AI red teaming has moved from a point-in-time exercise to a continuous control because model behaviour changes with context, updates, and tool access.
The central failure mode is semantic manipulation, where natural language can steer a system into unsafe actions without attacking the infrastructure directly.
Security teams need to govern AI applications as runtime actors, with scoped permissions, repeated adversarial testing, and drift-aware assurance.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Covers prompt injection and agent tool-use abuse in AI applications.
NIST AI RMF		Addresses continuous monitoring and governance for changing AI system behaviour.
NIST Zero Trust (SP 800-207)	PR.AC-4	Least-privilege access and continuous verification apply to AI tool permissions.

Red team prompts, tools, and action paths together, then retest after every model or workflow change.

Key terms

Non-deterministic system: A non-deterministic system does not produce reliably identical outcomes from identical inputs. In AI security, that means prompts, context, model updates, and tool connections can change the result, so assurance has to measure behaviour over time rather than assume fixed output patterns.
Prompt injection: Prompt injection is an attack that manipulates an AI system through language rather than infrastructure compromise. The malicious instruction may be direct or hidden inside normal content, and the risk is that the model follows the attacker’s intent while appearing to process legitimate input.
Agentic AI: Agentic AI is AI that can take actions, not only generate responses. When the system can call tools, write data, or trigger workflows, security must govern action scope, approvals, and monitoring because the model’s decisions can produce direct operational impact.
Runtime trust debt: Runtime trust debt is the growing gap between what an AI system was last validated for and what it can now do after changes to prompts, models, or tools. It is a useful governance concept because it captures why yesterday’s safe result does not guarantee today’s safe behaviour.

What's in the full article

Lakera's full article covers the operational detail this post intentionally leaves for the source:

The article expands on the design, regression, and drift-monitoring phases of continuous AI red teaming.
It explains how Lakera frames prompt attacks as application-specific rather than generic jailbreak tests.
It shows why tool-calling logic and system prompts must be tested together with model outputs.
It closes with the practical case for automated adversarial evaluation across the AI lifecycle.

👉 Lakera's full article covers continuous evaluation, prompt attack paths, and agentic AI risk in more detail.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-04-15.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org