Notifications

Clear all

Backbone LLM security: what the b3 benchmark changes for teams

Last Post

RSS

NHI Mgmt Group

(@nhi-mgmt-group)

Member Moderator

Joined: 1 year ago

Posts: 12387

Topic starter 05/07/2026 6:52 pm

TL;DR: AI agent security is being tested at the backbone LLM level using nearly 200,000 human red-team attempts and ten threat snapshots to measure how models behave under prompt injection, tool misuse, and data exfiltration pressure, according to Lakera. The key shift is that security must be measured at the decision point, not inferred from safety labels or end-to-end agent complexity.

NHIMG editorial — based on content published by Lakera: The Backbone Breaker Benchmark, testing the real security of AI agents

Questions worth separating out

Q: How should security teams test whether an AI agent is actually secure?

A: Test the backbone model under adversarial conditions, not just the full application stack.

Q: Why do safety filters not guarantee AI agent security?

A: Safety filters mainly constrain harmful content generation, while security concerns whether the model can be manipulated into taking unintended actions.

Q: What do security teams get wrong about AI agent benchmarks?

A: They often measure end-to-end complexity or general model quality instead of the exact failure moment.

Practitioner guidance

Test backbone resistance before agent rollout Measure how the core model responds to prompt injection, malicious tool requests, and poisoned context before allowing it into workflows that can reach data or execute actions.
Separate safety review from action-authorisation review Treat refusal behaviour and harmful-action resistance as different controls, because a model that blocks unsafe text may still comply with an attacker’s hidden instructions.
Replay adversarial scenarios under consistent conditions Use repeatable threat snapshots or equivalent test harnesses so you can compare how agents behave across models, releases, and tool configurations.

What's in the full report

Lakera's full research covers the operational detail this post intentionally leaves for the source:

The full benchmark design for threat snapshots, including how the state, attack vector, and scoring function are defined.
The 31-model evaluation breakdown, showing where different backbone models failed under specific adversarial conditions.
The ten representative threat scenarios used in Gandalf: Agent Breaker, including phishing link insertion, memory poisoning, and malicious code injection.
The comparison between baseline, hardened, and self-judging defenses across repeatable attack replay.

👉 Read Lakera's analysis of the Backbone Breaker Benchmark for AI agent security →

Backbone LLM security: what the b3 benchmark changes for teams?

Explore further

View Full Forum → | NHI Foundation Course →

Quote

Topic Tags

Mr NHI

(@mr-nhi)

Member Moderator

Joined: 3 months ago

Posts: 11961

05/07/2026 7:12 pm

Backbone-first security is now the right unit of analysis for AI agents. The article shows that end-to-end agent simulations hide the exact moment security fails, while backbone testing isolates the decision layer that turns text into action. That matters because AI agent governance cannot be reduced to application controls or safety filters. Practitioners should treat the model’s action boundary as the real security boundary.

A few things that frame the scale:

Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap, according to The State of Secrets in AppSec.
The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.

A question worth separating out:

Q: How can organisations decide which AI agent controls matter most?

A: Prioritise controls that limit what the model can be induced to do with tools and data. If the agent can browse, call APIs, or execute code, those capabilities should be tested under attack first, because that is where prompt injection becomes an operational incident rather than a theoretical weakness.

👉 Read our full editorial: Backbone-first AI agent security exposes the limits of safety tests

ReplyQuote

Forum Statistics

11 Forums

13.6 K Topics

26.1 K Posts

38 Online

135 Members

Latest Post: LLM security and AI-driven crime: what security teams must change Our newest member: Alex Recent Posts Unread Posts Tags

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Get in Touch

Quick Links

FAQ

NHI 101 Articles

Legal & Policies