TL;DR: LLM defenses evolve from no checks to layered input and output filtering, yet still remain vulnerable to prompt injection, indirect elicitation, and transcript-based leakage across seven difficulty levels, according to Lakera’s Gandalf walkthrough. The deeper lesson is that blocking keywords is not identity governance, and runtime control must account for what the model can be induced to do.
NHIMG editorial — based on content published by Lakera: Who Is Gandalf? The AI Challenge That Tests Your Prompting Skills
By the numbers:
- At peak times, Gandalf has processed over 50 prompts every second.
Questions worth separating out
Q: How should security teams reduce prompt injection risk in LLM applications?
A: Security teams should combine input screening, output filtering, transcript-level inspection, and strict secret placement rules.
Q: Why do keyword filters fail against prompt injection attacks?
A: Keyword filters fail because attackers rarely need the exact forbidden term to recover sensitive information.
Q: What breaks when an LLM is treated as a trusted policy enforcement point?
A: What breaks is the assumption that the model will consistently refuse unsafe disclosure just because it was instructed to do so.
Practitioner guidance
- Classify model outputs as governed disclosure events Treat any response that could reveal secrets, internal instructions, or sensitive context as a control decision subject to logging, review, and policy enforcement.
- Test for indirect prompt injection paths Red-team against translation, encoding, partial disclosure, roleplay, and multi-turn reconstruction rather than only obvious forbidden-word prompts.
- Move secret material out of prompt-adjacent context Do not place passwords, API keys, or sensitive tokens where a model can be induced to paraphrase, echo, or infer them from conversation state.
What's in the full article
Lakera's full article covers the level-by-level mechanics this post intentionally leaves at the governance layer:
- Seven-stage walkthrough of how Gandalf changes its defences as prompt difficulty increases
- Concrete prompt examples that bypass specific input and output guard patterns
- How the transcript-checking approach works when a second model inspects the full conversation
- The article's own explanation of which attack styles worked best at each level
👉 Read Lakera's walkthrough of Gandalf and the prompt injection challenge →
Prompt injection defenses in LLMs: what keeps breaking?
Explore further
Prompt injection is a governance problem, not just a model safety problem. The article shows that the real failure is not whether the model can answer a bad question, but whether the surrounding control plane can preserve disclosure boundaries under adversarial language. That is the same structural issue IAM faces when policy is expressed as advice rather than enforceable control. Practitioners should treat LLM disclosure as a governed access surface, not a conversational feature.
A few things that frame the scale:
- The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
- Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap.
A question worth separating out:
Q: How can organisations tell whether their LLM controls are actually working?
A: They should measure whether the system stops indirect, multi-turn, and semantically equivalent requests, not just exact forbidden phrases. A control is only effective if it blocks disclosure across paraphrase, translation, and transcript reconstruction. If the model still leaks secret context through creative prompting, the governance boundary is not holding.
👉 Read our full editorial: AI prompt injection defenses still fail when models reveal secrets