TL;DR: AI agent security is being tested at the backbone LLM level using nearly 200,000 human red-team attempts and ten threat snapshots to measure how models behave under prompt injection, tool misuse, and data exfiltration pressure, according to Lakera. The key shift is that security must be measured at the decision point, not inferred from safety labels or end-to-end agent complexity.
NHIMG editorial — based on content published by Lakera: The Backbone Breaker Benchmark, testing the real security of AI agents
Questions worth separating out
Q: How should security teams test whether an AI agent is actually secure?
A: Test the backbone model under adversarial conditions, not just the full application stack.
Q: Why do safety filters not guarantee AI agent security?
A: Safety filters mainly constrain harmful content generation, while security concerns whether the model can be manipulated into taking unintended actions.
Q: What do security teams get wrong about AI agent benchmarks?
A: They often measure end-to-end complexity or general model quality instead of the exact failure moment.
Practitioner guidance
- Test backbone resistance before agent rollout Measure how the core model responds to prompt injection, malicious tool requests, and poisoned context before allowing it into workflows that can reach data or execute actions.
- Separate safety review from action-authorisation review Treat refusal behaviour and harmful-action resistance as different controls, because a model that blocks unsafe text may still comply with an attacker’s hidden instructions.
- Replay adversarial scenarios under consistent conditions Use repeatable threat snapshots or equivalent test harnesses so you can compare how agents behave across models, releases, and tool configurations.
What's in the full report
Lakera's full research covers the operational detail this post intentionally leaves for the source:
- The full benchmark design for threat snapshots, including how the state, attack vector, and scoring function are defined.
- The 31-model evaluation breakdown, showing where different backbone models failed under specific adversarial conditions.
- The ten representative threat scenarios used in Gandalf: Agent Breaker, including phishing link insertion, memory poisoning, and malicious code injection.
- The comparison between baseline, hardened, and self-judging defenses across repeatable attack replay.
👉 Read Lakera's analysis of the Backbone Breaker Benchmark for AI agent security →
Backbone LLM security: what the b3 benchmark changes for teams?
Explore further
Backbone-first security is now the right unit of analysis for AI agents. The article shows that end-to-end agent simulations hide the exact moment security fails, while backbone testing isolates the decision layer that turns text into action. That matters because AI agent governance cannot be reduced to application controls or safety filters. Practitioners should treat the model’s action boundary as the real security boundary.
A few things that frame the scale:
- Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap, according to The State of Secrets in AppSec.
- The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
A question worth separating out:
Q: How can organisations decide which AI agent controls matter most?
A: Prioritise controls that limit what the model can be induced to do with tools and data. If the agent can browse, call APIs, or execute code, those capabilities should be tested under attack first, because that is where prompt injection becomes an operational incident rather than a theoretical weakness.
👉 Read our full editorial: Backbone-first AI agent security exposes the limits of safety tests