Backbone-first AI agent security exposes the limits of safety tests

By NHI Mgmt Group Editorial TeamPublished 2025-10-30Domain: Agentic AI & NHIsSource: Lakera

TL;DR: AI agent security is being tested at the backbone LLM level using nearly 200,000 human red-team attempts and ten threat snapshots to measure how models behave under prompt injection, tool misuse, and data exfiltration pressure, according to Lakera. The key shift is that security must be measured at the decision point, not inferred from safety labels or end-to-end agent complexity.

At a glance

What this is: This is a research post on a new benchmark for testing AI agent security at the backbone LLM level, with the central finding that security failures are best measured at the model’s decision point.

Why it matters: It matters because IAM, NHI, and AI governance teams need a way to assess what an agent can be made to do, not just what it can say, when tool use and delegated access are in play.

👉 Read Lakera's analysis of the Backbone Breaker Benchmark for AI agent security

Context

AI agent security fails when teams confuse model safety with model exploitability. The article argues that a backbone LLM can be safe in the content-moderation sense and still be manipulated into harmful actions when prompts, files, or web inputs are weaponised. For identity programmes, that means the meaningful boundary is not the chatbot interface but the decision point where the agent interprets instructions and acts.

That distinction matters across NHI, autonomous systems, and IAM governance. Once an agent can browse, call APIs, execute code, and choose actions from context, security has to focus on what the identity is allowed to do under pressure, not just whether the surrounding application is hardened. For readers building controls, the question becomes whether current governance can measure behavioural resistance in the same way it measures access.

Key questions

Q: How should security teams test whether an AI agent is actually secure?

A: Test the backbone model under adversarial conditions, not just the full application stack. Reproduce prompt injection, malicious tool requests, poisoned context, and data-exfiltration attempts with repeatable scenarios so you can see whether the agent resists harmful actions at the decision point. A useful test tells you what the model can be made to do, not only what it refuses to say.

Q: Why do safety filters not guarantee AI agent security?

A: Safety filters mainly constrain harmful content generation, while security concerns whether the model can be manipulated into taking unintended actions. An agent can refuse unsafe wording and still follow hidden instructions embedded in prompts, files, or web content. That is why teams need action-resistance testing in addition to content-safety checks.

Q: What do security teams get wrong about AI agent benchmarks?

A: They often measure end-to-end complexity or general model quality instead of the exact failure moment. The result is a score that looks useful but does not isolate whether the backbone, tool access, or orchestration caused the problem. Benchmarks need a specific state, attack vector, and scoring function to be operationally meaningful.

Q: How can organisations decide which AI agent controls matter most?

A: Prioritise controls that limit what the model can be induced to do with tools and data. If the agent can browse, call APIs, or execute code, those capabilities should be tested under attack first, because that is where prompt injection becomes an operational incident rather than a theoretical weakness.

Technical breakdown

Backbone LLM security versus full-agent testing

A backbone LLM is the core model that generates outputs, reasons through steps, and may decide whether to invoke tools or end a flow. Full-agent testing tries to evaluate every surrounding component at once, which creates noise from orchestration, memory, APIs, and interface logic. Backbone-first testing isolates the specific moments where the model itself fails, making security behaviour measurable rather than anecdotal. The value is not that the agent becomes simpler, but that the failure surface becomes attributable to one decision layer instead of a whole stack of interacting controls.

Practical implication: separate model-resistance testing from application hardening so you can see whether failures come from the backbone, the toolchain, or both.

Threat snapshots for prompt injection and tool misuse

Threat snapshots are micro-tests that capture one attack moment, one state, one objective, and one scoring function. Instead of asking whether an AI agent is generally secure, the benchmark asks how it reacts when a specific prompt, file, or web input tries to trigger malicious behaviour. That approach covers indirect prompt injection, malicious tool calls, memory poisoning, and data exfiltration in a repeatable way. It also makes comparison possible across models and application types because each test is replayed under consistent conditions rather than judged from a single observed failure.

Practical implication: use scenario-based security tests that replay the same adversarial pattern repeatedly, not one-off red-team anecdotes.

Reasoning depth changes exploitability

The benchmark reports that models which reason step by step are harder to exploit than models that answer immediately. That does not make them safe by default, but it shows that security is shaped by decision structure, not only by model size or content filters. Bigger models were not automatically safer, and safe-content alignment did not prevent harmful action under indirect attack. The core insight is that instruction-following is the same capability that creates utility and exposure, so the real problem is whether the model can distinguish adversarial context from legitimate task input.

Practical implication: evaluate reasoning behaviour as a security control, not just a capability metric, before trusting an agent with sensitive actions.

Threat narrative

Attacker objective: The attacker aims to manipulate the agent into taking harmful actions that appear to be normal task execution.

Entry occurs when a malicious prompt, file, or web input reaches the backbone LLM inside an otherwise legitimate agent workflow.
Escalation happens when the model treats attacker-supplied text as instruction, triggering tool calls, data extraction, or policy bypass from within the session.
Impact follows when the agent performs unintended actions such as exfiltrating data, inserting phishing links, or executing malicious code through delegated tools.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Backbone-first security is now the right unit of analysis for AI agents. The article shows that end-to-end agent simulations hide the exact moment security fails, while backbone testing isolates the decision layer that turns text into action. That matters because AI agent governance cannot be reduced to application controls or safety filters. Practitioners should treat the model’s action boundary as the real security boundary.

Safety and security are different governance problems, and conflating them breaks agent oversight. A model can refuse harmful content and still be induced to perform harmful acts through indirect attack paths. That distinction is critical for identity leaders because permissioning, tool access, and agent behaviour are separable control domains. The implication is that content moderation alone does not satisfy security assurance for delegated AI execution.

Threat snapshots create a useful named concept: decision-point resistance. This benchmark treats the instant of model choice as the measurable security event, not the downstream outcome. That framing is stronger than generic red-teaming because it links exploitability to a repeatable state, an objective, and a score. Practitioners should use decision-point resistance as a testable criterion when evaluating AI agent security claims.

Reasoning depth is becoming a security variable, not just a quality attribute. The finding that step-by-step reasoning reduces vulnerability by about 15% against injection-based attacks suggests that security emerges from how a model processes input, not only from what it has been trained to avoid. That does not eliminate risk, but it changes the design conversation. Teams should re-evaluate whether their AI governance model measures outputs only, or also the decision structure that produces them.

Traditional IAM assumptions do not fully describe autonomous tool use. Identity programmes usually assume access decisions can be reviewed after the fact and that controls sit around stable workflows. In agentic systems, the decision path can change inside the session, and the attack surface is the agent’s willingness to convert text into action. The implication is that identity governance for AI must account for runtime behaviour, not just entitlement state.

From our research:
Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap, according to The State of Secrets in AppSec.
The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
For teams building AI governance, Ultimate Guide to NHIs , Lifecycle Processes for Managing NHIs provides the lifecycle context needed to connect identity, rotation, and offboarding controls.

What this signals

Decision-point resistance is becoming the practical benchmark for AI governance because model safety does not tell you whether an agent can be induced to act. Teams that allow tool use need to watch how often their controls evaluate the agent’s action path, not just its output path.

The governance gap will widen if AI programmes continue to inherit IAM assumptions built for stable, reviewable access. For practitioners, the next step is to align agent testing, tool permissions, and lifecycle oversight with the way the model actually consumes context and executes actions.

With 27 days as the average time to remediate a leaked secret in our research on secrets management, the broader lesson is that identity control gaps persist long after detection, which is exactly why runtime evaluation matters for agents that can act in seconds.

For practitioners

Test backbone resistance before agent rollout Measure how the core model responds to prompt injection, malicious tool requests, and poisoned context before allowing it into workflows that can reach data or execute actions.
Separate safety review from action-authorisation review Treat refusal behaviour and harmful-action resistance as different controls, because a model that blocks unsafe text may still comply with an attacker’s hidden instructions.
Replay adversarial scenarios under consistent conditions Use repeatable threat snapshots or equivalent test harnesses so you can compare how agents behave across models, releases, and tool configurations.
Map which tools an agent can misuse under pressure Inventory the APIs, code execution paths, and data sources that become dangerous if the backbone model accepts attacker-controlled instructions.

Key takeaways

AI agent security should be measured at the model’s decision point, because that is where adversarial input becomes harmful action.
Safety filtering and action resistance are not the same control, so a secure-looking model can still be operationally exploitable.
Practitioners need repeatable adversarial testing that isolates the backbone LLM, because full-agent complexity can hide the real failure mode.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Covers agent prompt injection and tool misuse tested by the benchmark.
NIST AI RMF		Addresses governance and measurement for AI behaviour under attack.
OWASP Non-Human Identity Top 10	NHI-01	AI agents act as non-human identities when they access tools and data.

Map agent test cases to OWASP-AGENTIC threats before allowing tool-enabled deployment.

Key terms

Backbone LLM: The backbone LLM is the core model that interprets input, generates output, and may decide whether to continue, stop, or call tools. In agentic systems, it is the security-critical decision layer because malicious context can be converted into action even when surrounding software appears well controlled.
Threat Snapshot: A threat snapshot is a focused test case that freezes one attack moment, one system state, and one objective so security can be measured consistently. It lets teams compare how a model behaves under the same adversarial pressure rather than relying on broad, hard-to-reproduce agent simulations.
Action Resistance: Action resistance is the ability of a model or agent to avoid carrying out harmful instructions when those instructions are embedded in otherwise legitimate input. It is different from content safety, because the key question is not what the model says but what it can be induced to do.

What's in the full report

Lakera's full research covers the operational detail this post intentionally leaves for the source:

The full benchmark design for threat snapshots, including how the state, attack vector, and scoring function are defined.
The 31-model evaluation breakdown, showing where different backbone models failed under specific adversarial conditions.
The ten representative threat scenarios used in Gandalf: Agent Breaker, including phishing link insertion, memory poisoning, and malicious code injection.
The comparison between baseline, hardened, and self-judging defenses across repeatable attack replay.

👉 Lakera's full research covers the threat snapshots, model comparisons, and attack replay method behind b3.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or lifecycle governance in your organisation, it is worth exploring.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-10-30.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org