AI prompt injection defenses still fail when models reveal secrets

By NHI Mgmt Group Editorial TeamPublished 2026-04-20Domain: Agentic AI & NHIsSource: Lakera

TL;DR: LLM defenses evolve from no checks to layered input and output filtering, yet still remain vulnerable to prompt injection, indirect elicitation, and transcript-based leakage across seven difficulty levels, according to Lakera’s Gandalf walkthrough. The deeper lesson is that blocking keywords is not identity governance, and runtime control must account for what the model can be induced to do.

At a glance

What this is: This is an analysis of Lakera’s Gandalf challenge and what it reveals about prompt injection, model leakage, and layered LLM defenses.

Why it matters: It matters because IAM, NHI, and autonomous AI programmes now have to govern not just access, but model behaviour, disclosure boundaries, and runtime control paths.

By the numbers:

Our challenge was more popular than expected since releasing it about 20 days ago, Gandalf registered close to 9M interactions from over 200k unique users.
At peak times, Gandalf has processed over 50 prompts every second.

👉 Read Lakera's walkthrough of Gandalf and the prompt injection challenge

Context

Prompt injection is not just a model safety issue. It is a control problem, because the model can be induced to disclose secrets or ignore intended boundaries even when the surrounding application believes the policy is simple and explicit.

For identity teams, that means LLM systems need governance around what the model may reveal, how inputs are screened, and how outputs are checked. The article uses Gandalf to show why keyword blocking alone does not create reliable access control for AI systems.

The primary keyword, prompt injection, is central here because the challenge demonstrates how easily a model can be pushed into revealing information when controls focus on text patterns instead of runtime behaviour.

Key questions

Q: How should security teams reduce prompt injection risk in LLM applications?

A: Security teams should combine input screening, output filtering, transcript-level inspection, and strict secret placement rules. The important shift is to govern the model as a disclosure surface, not just as a chatbot. That means testing indirect elicitation, multi-turn leakage, and semantic paraphrase, then enforcing controls around what the model can access, repeat, or reconstruct.

Q: Why do keyword filters fail against prompt injection attacks?

A: Keyword filters fail because attackers rarely need the exact forbidden term to recover sensitive information. They can ask in another language, split the secret into pieces, use roleplay, or prompt the model to explain or encode the answer. Effective control has to understand the intent and the conversation, not just match individual words.

Q: What breaks when an LLM is treated as a trusted policy enforcement point?

A: What breaks is the assumption that the model will consistently refuse unsafe disclosure just because it was instructed to do so. Prompting can shift the model’s response enough to leak secrets, and that makes it an unreliable sole enforcement point. Teams need external controls that verify both inputs and outputs before treating the result as safe.

Q: How can organisations tell whether their LLM controls are actually working?

A: They should measure whether the system stops indirect, multi-turn, and semantically equivalent requests, not just exact forbidden phrases. A control is only effective if it blocks disclosure across paraphrase, translation, and transcript reconstruction. If the model still leaks secret context through creative prompting, the governance boundary is not holding.

Technical breakdown

Why system prompts do not function like durable access controls

A system prompt sets initial behaviour, but it is not the same as an enforceable security boundary. In Gandalf, the model is told not to reveal the password, yet that instruction can still be destabilised by user input, indirect requests, or adversarial phrasing. That is the core weakness of relying on instructions alone: they shape behaviour, but they do not reliably constrain disclosure once the conversation starts. In practice, the model and its transcript become the real enforcement surface, not the prompt text itself.

Practical implication: treat the system prompt as a policy hint, not as the only control protecting secrets or regulated data.

Input guards, output guards, and transcript inspection

Lakera’s walkthrough shows three broad control patterns: block the input, block the output, or inspect the full transcript with another model. Each step raises the bar, but none of them is automatically sufficient because attackers can rephrase, split a secret across turns, or use semantic indirection. Transcript inspection is stronger than keyword checks because it reasons over context, but it still depends on the quality of the classifier and the scope of what it recognises as leakage. This is why LLM governance needs layered detection, not a single denial rule.

Practical implication: apply layered prompt and response inspection, and test whether the classifier catches semantic leakage rather than only exact strings.

Prompt injection as a disclosure-chain problem

The challenge is not just about asking for a password. It is about persuading the model to participate in a disclosure chain that turns benign language into secret recovery. That can include translation, encoding, guessing games, partial disclosure, or asking for indirect description. This makes prompt injection fundamentally different from a simple blocked-command scenario. The model’s weakness lies in contextual compliance, where it can be led to treat the attacker’s framing as legitimate enough to produce the forbidden content.

Practical implication: test LLM applications against indirect elicitation, partial-response leakage, and multi-turn exfiltration patterns.

NHI Mgmt Group analysis

Prompt injection is a governance problem, not just a model safety problem. The article shows that the real failure is not whether the model can answer a bad question, but whether the surrounding control plane can preserve disclosure boundaries under adversarial language. That is the same structural issue IAM faces when policy is expressed as advice rather than enforceable control. Practitioners should treat LLM disclosure as a governed access surface, not a conversational feature.

Keyword-based denial is too brittle to serve as identity enforcement for AI systems. Gandalf’s later levels show that exact-match filters break under paraphrase, translation, partial disclosure, and semantic indirection. The named concept here is prompt leakage boundary drift: the point at which the system’s idea of prohibited content becomes narrower than the attacker’s route to the same secret. That drift means policy and detection must operate on intent and context, not only text fragments. Practitioners should assume that static filters will be bypassed.

LLM secrets handling now behaves like machine identity governance under adversarial pressure. Once a model can be prompted into revealing internal state, the distinction between “content” and “credential” collapses. This is relevant to NHI and workload identity teams because secrets embedded in prompts, logs, or retrieval context become recoverable through model behaviour rather than direct access. The implication is that access policy, secret placement, and output controls must be designed together.

Defenses that inspect only one turn miss the real exfiltration path. The article repeatedly demonstrates that attackers can spread disclosure across multiple messages, then reconstruct the secret from fragments. That is the same problem access review systems face when they assume risk is visible in one event. For AI governance, the lesson is that control effectiveness depends on conversation-level state, not isolated prompts.

Autonomous behaviour is not required for risk, but runtime decision-making still changes the attack surface. Gandalf is not autonomous in the strict identity sense, yet it still makes runtime outputs that can be steered by the user. That means enterprises should not wait for fully autonomous agents before governing model behaviour. The immediate issue is that AI systems already mediate access, interpret intent, and surface secrets in ways legacy IAM controls were never built to inspect.

From our research:
The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap.
For a broader governance lens, the NHI Lifecycle Management Guide shows why discovery, rotation, and offboarding have to be managed as a lifecycle, not a one-time fix.

What this signals

Prompt injection work should now be read as a warning for broader AI governance. If a model can be pushed into revealing secrets through indirect language, then the enterprise has a disclosure-control problem that spans prompts, logs, retrieval layers, and downstream workflows.

Prompt leakage boundary drift: once controls are phrased as simple refusals, the attacker only needs a path around the refusal language. That means security teams should test the full boundary between policy intent and model behaviour, not just the obvious input filter.

With the average estimated time to remediate a leaked secret at 27 days according to The State of Secrets in AppSec, delayed response becomes part of the risk model. Teams should pair AI content controls with the lifecycle discipline described in the NHI Lifecycle Management Guide.

For practitioners

Classify model outputs as governed disclosure events Treat any response that could reveal secrets, internal instructions, or sensitive context as a control decision subject to logging, review, and policy enforcement.
Test for indirect prompt injection paths Red-team against translation, encoding, partial disclosure, roleplay, and multi-turn reconstruction rather than only obvious forbidden-word prompts.
Move secret material out of prompt-adjacent context Do not place passwords, API keys, or sensitive tokens where a model can be induced to paraphrase, echo, or infer them from conversation state.
Inspect the full conversation transcript Evaluate leakage across multiple turns, because a safe single response can still become an unsafe reconstructed secret when combined with earlier prompts.

Key takeaways

Gandalf shows that LLMs can be steered into disclosing secrets even when the system prompt says not to reveal them.
The practical failure mode is not one bad prompt, but the collapse of keyword-based boundaries under paraphrase, translation, and multi-turn reconstruction.
Security teams need transcript-aware controls and tighter secret placement rules if they want LLM governance to hold under real attacker behaviour.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Prompt injection and disclosure control are core agentic AI threat patterns.
NIST AI RMF		AI RMF covers governance and measurement of model risks like secret leakage.
NIST CSF 2.0	PR.AC-4	Access control principles apply when models can reveal protected information.

Test LLM systems for prompt injection, tool misuse, and unsafe disclosure before production rollout.

Key terms

Prompt Injection: Prompt injection is an attack that manipulates an AI model through crafted input so it follows the attacker’s intent instead of the system’s intended instruction. In practice, it exploits the model’s sensitivity to context, making disclosure, tool use, or refusal behaviour part of the attack surface.
Transcript Inspection: Transcript inspection is the practice of evaluating the full conversation, not just a single prompt or response, for signs of policy violation or leakage. It matters because unsafe content can be reconstructed across turns even when each individual message looks harmless on its own.
Prompt Leakage Boundary Drift: Prompt leakage boundary drift is the gap that appears when a system’s idea of forbidden output becomes narrower than the attacker’s routes to the same secret. The model still seems governed, but the effective boundary has shifted through paraphrase, translation, encoding, or multi-turn reconstruction.

What's in the full article

Lakera's full article covers the level-by-level mechanics this post intentionally leaves at the governance layer:

Seven-stage walkthrough of how Gandalf changes its defences as prompt difficulty increases
Concrete prompt examples that bypass specific input and output guard patterns
How the transcript-checking approach works when a second model inspects the full conversation
The article's own explanation of which attack styles worked best at each level

👉 Lakera's full post shows how each Gandalf level behaves and where the defences still give way

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or lifecycle governance in your organisation, it is worth exploring.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-04-20.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org