Notifications

Clear all

Agent breaker testbeds: what they reveal about GenAI security

Last Post

RSS

NHI Mgmt Group

(@nhi-mgmt-group)

Member Moderator

Joined: 1 year ago

Posts: 12387

Topic starter 05/07/2026 6:53 pm

TL;DR: Agent Breaker models real-world GenAI attack surfaces such as indirect prompt injection, tool poisoning, context leaks, and goal hijacking inside playable levels that score partial and full success, according to Lakera. The broader lesson is that GenAI security needs attack-aware testing, not just prompt hygiene, because the failure modes are architectural and measurable.

NHIMG editorial — based on content published by Lakera: Inside Agent Breaker, building a real-world GenAI security playground

Questions worth separating out

Q: How should security teams test GenAI agents for prompt injection risk?

A: Security teams should test GenAI agents with both direct prompts and indirect inputs such as documents, webpages, and tool metadata.

Q: Why do AI agents create identity governance problems that standard app controls miss?

A: AI agents can ingest content, choose actions, and invoke tools inside the same runtime, so the security question becomes who or what is authorised to influence those actions.

Q: What do teams get wrong about defending against indirect prompt injection?

A: Teams often treat indirect prompt injection as a content safety issue when it is really a trust-boundary issue.

Practitioner guidance

Separate ingestion trust from execution trust Mark retrieved documents, webpages, and tool descriptions as untrusted until they pass a policy check that is independent from the model’s reasoning layer.
Scope agent tool access to the smallest viable action set Inventory every tool an agent can call and remove broad permissions that are not required for the task.
Test for partial compromise, not just complete jailbreaks Use red-team exercises and scoring criteria that detect partial leaks, partial tool misuse, and partial objective drift.

What's in the full article

Lakera's full article covers the technical design detail this post intentionally leaves at a higher level:

How the ten mock agentic apps are structured and how each threat snapshot maps to a real production pattern
The scoring logic for partial success, including how BLEU, ROUGE, embedding similarity, and classifiers are combined
Why guardrails change from Level 1 to Level 5 and how the intent classifier, LLM judge, and final filter interact
Examples of the attack objectives used in the playground, such as tool extraction, prompt extraction, and toxicity injection

👉 Read Lakera's analysis of Agent Breaker and GenAI attack mechanics →

Agent breaker testbeds: what they reveal about GenAI security?

Explore further

View Full Forum → | NHI Foundation Course →

Quote

Topic Tags

Mr NHI

(@mr-nhi)

Member Moderator

Joined: 3 months ago

Posts: 11961

05/07/2026 7:14 pm

Agentic security testing is now an identity governance problem, not just an LLM safety exercise. The article shows that once a model can read external context and call tools, the meaningful control surface becomes delegated access. That moves the discussion from prompt quality to authority boundaries, which is exactly where identity teams operate. Practitioners should treat the agent runtime as an identity-bearing system with constrained permissions, observable actions, and explicit trust boundaries.

A few things that frame the scale:

The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to LLMjacking: How Attackers Hijack AI Using Compromised NHIs.
43% of security professionals are concerned about AI systems learning and reproducing sensitive information patterns from codebases, which shows the governance issue is already affecting design assumptions.

A question worth separating out:

Q: How can organisations measure whether GenAI guardrails are actually working?

A: Organisations should measure whether guardrails prevent partial leaks, partial tool misuse, and objective drift, not just whether they block obvious jailbreaks. A useful program looks for how far an attack progressed, what information escaped, and whether the agent still behaved inside its intended scope. If the system only tracks binary success, it will miss the most common real-world failure states.

👉 Read our full editorial: Agent breaker shows how GenAI security testbeds model real attacks

ReplyQuote

Forum Statistics

11 Forums

13.6 K Topics

26.1 K Posts

33 Online

135 Members

Latest Post: LLM security and AI-driven crime: what security teams must change Our newest member: Alex Recent Posts Unread Posts Tags

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Get in Touch

Quick Links

FAQ

NHI 101 Articles

Legal & Policies