Subscribe to the Non-Human & AI Identity Journal

Notifications
Clear all

Agent breaker testbeds: what they reveal about GenAI security


(@nhi-mgmt-group)
Member Moderator
Joined: 1 year ago
Posts: 9271
Topic starter  

TL;DR: Agent Breaker models real-world GenAI attack surfaces such as indirect prompt injection, tool poisoning, context leaks, and goal hijacking inside playable levels that score partial and full success, according to Lakera. The broader lesson is that GenAI security needs attack-aware testing, not just prompt hygiene, because the failure modes are architectural and measurable.

NHIMG editorial — based on content published by Lakera: Inside Agent Breaker, building a real-world GenAI security playground

Questions worth separating out

Q: How should security teams test GenAI agents for prompt injection risk?

A: Security teams should test GenAI agents with both direct prompts and indirect inputs such as documents, webpages, and tool metadata.

Q: Why do AI agents create identity governance problems that standard app controls miss?

A: AI agents can ingest content, choose actions, and invoke tools inside the same runtime, so the security question becomes who or what is authorised to influence those actions.

Q: What do teams get wrong about defending against indirect prompt injection?

A: Teams often treat indirect prompt injection as a content safety issue when it is really a trust-boundary issue.

Practitioner guidance

  • Separate ingestion trust from execution trust Mark retrieved documents, webpages, and tool descriptions as untrusted until they pass a policy check that is independent from the model’s reasoning layer.
  • Scope agent tool access to the smallest viable action set Inventory every tool an agent can call and remove broad permissions that are not required for the task.
  • Test for partial compromise, not just complete jailbreaks Use red-team exercises and scoring criteria that detect partial leaks, partial tool misuse, and partial objective drift.

What's in the full article

Lakera's full article covers the technical design detail this post intentionally leaves at a higher level:

  • How the ten mock agentic apps are structured and how each threat snapshot maps to a real production pattern
  • The scoring logic for partial success, including how BLEU, ROUGE, embedding similarity, and classifiers are combined
  • Why guardrails change from Level 1 to Level 5 and how the intent classifier, LLM judge, and final filter interact
  • Examples of the attack objectives used in the playground, such as tool extraction, prompt extraction, and toxicity injection

👉 Read Lakera's analysis of Agent Breaker and GenAI attack mechanics →

Agent breaker testbeds: what they reveal about GenAI security?

Explore further

View Full Forum →  |  NHI Foundation Course →



   
Quote
(@mr-nhi)
Member Moderator
Joined: 2 months ago
Posts: 8712
 

Agentic security testing is now an identity governance problem, not just an LLM safety exercise. The article shows that once a model can read external context and call tools, the meaningful control surface becomes delegated access. That moves the discussion from prompt quality to authority boundaries, which is exactly where identity teams operate. Practitioners should treat the agent runtime as an identity-bearing system with constrained permissions, observable actions, and explicit trust boundaries.

A few things that frame the scale:

  • The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to LLMjacking: How Attackers Hijack AI Using Compromised NHIs.
  • 43% of security professionals are concerned about AI systems learning and reproducing sensitive information patterns from codebases, which shows the governance issue is already affecting design assumptions.

A question worth separating out:

Q: How can organisations measure whether GenAI guardrails are actually working?

A: Organisations should measure whether guardrails prevent partial leaks, partial tool misuse, and objective drift, not just whether they block obvious jailbreaks. A useful program looks for how far an attack progressed, what information escaped, and whether the agent still behaved inside its intended scope. If the system only tracks binary success, it will miss the most common real-world failure states.

👉 Read our full editorial: Agent breaker shows how GenAI security testbeds model real attacks



   
ReplyQuote
Share: