Notifications

Clear all

Prompt injection defenses in LLMs: what keeps breaking?

Last Post

RSS

NHI Mgmt Group

(@nhi-mgmt-group)

Member Moderator

Joined: 1 year ago

Posts: 12387

Topic starter 05/07/2026 6:46 pm

TL;DR: LLM defenses evolve from no checks to layered input and output filtering, yet still remain vulnerable to prompt injection, indirect elicitation, and transcript-based leakage across seven difficulty levels, according to Lakera’s Gandalf walkthrough. The deeper lesson is that blocking keywords is not identity governance, and runtime control must account for what the model can be induced to do.

NHIMG editorial — based on content published by Lakera: Who Is Gandalf? The AI Challenge That Tests Your Prompting Skills

By the numbers:

At peak times, Gandalf has processed over 50 prompts every second.

Questions worth separating out

Q: How should security teams reduce prompt injection risk in LLM applications?

A: Security teams should combine input screening, output filtering, transcript-level inspection, and strict secret placement rules.

Q: Why do keyword filters fail against prompt injection attacks?

A: Keyword filters fail because attackers rarely need the exact forbidden term to recover sensitive information.

Q: What breaks when an LLM is treated as a trusted policy enforcement point?

A: What breaks is the assumption that the model will consistently refuse unsafe disclosure just because it was instructed to do so.

Practitioner guidance

Classify model outputs as governed disclosure events Treat any response that could reveal secrets, internal instructions, or sensitive context as a control decision subject to logging, review, and policy enforcement.
Test for indirect prompt injection paths Red-team against translation, encoding, partial disclosure, roleplay, and multi-turn reconstruction rather than only obvious forbidden-word prompts.
Move secret material out of prompt-adjacent context Do not place passwords, API keys, or sensitive tokens where a model can be induced to paraphrase, echo, or infer them from conversation state.

What's in the full article

Lakera's full article covers the level-by-level mechanics this post intentionally leaves at the governance layer:

Seven-stage walkthrough of how Gandalf changes its defences as prompt difficulty increases
Concrete prompt examples that bypass specific input and output guard patterns
How the transcript-checking approach works when a second model inspects the full conversation
The article's own explanation of which attack styles worked best at each level

👉 Read Lakera's walkthrough of Gandalf and the prompt injection challenge →

Prompt injection defenses in LLMs: what keeps breaking?

Explore further

View Full Forum → | NHI Foundation Course →

Quote

Topic Tags

Mr NHI

(@mr-nhi)

Member Moderator

Joined: 3 months ago

Posts: 11961

05/07/2026 7:04 pm

Prompt injection is a governance problem, not just a model safety problem. The article shows that the real failure is not whether the model can answer a bad question, but whether the surrounding control plane can preserve disclosure boundaries under adversarial language. That is the same structural issue IAM faces when policy is expressed as advice rather than enforceable control. Practitioners should treat LLM disclosure as a governed access surface, not a conversational feature.

A few things that frame the scale:

The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap.

A question worth separating out:

Q: How can organisations tell whether their LLM controls are actually working?

A: They should measure whether the system stops indirect, multi-turn, and semantically equivalent requests, not just exact forbidden phrases. A control is only effective if it blocks disclosure across paraphrase, translation, and transcript reconstruction. If the model still leaks secret context through creative prompting, the governance boundary is not holding.

👉 Read our full editorial: AI prompt injection defenses still fail when models reveal secrets

ReplyQuote

Forum Statistics

11 Forums

13.6 K Topics

26.1 K Posts

39 Online

135 Members

Latest Post: LLM security and AI-driven crime: what security teams must change Our newest member: Alex Recent Posts Unread Posts Tags

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Get in Touch

Quick Links

FAQ

NHI 101 Articles

Legal & Policies