By NHI Mgmt Group Editorial TeamPublished 2025-09-08Domain: Agentic AI & NHIsSource: Lakera

TL;DR: Static LLM defenses fail as attackers adapt, and security must be measured against both attack resistance and user utility, according to Lakera’s Gandalf the Red analysis. The practical lesson is that LLM governance now needs adaptive controls, layered defenses, and narrower application scope rather than one fixed prompt-time safeguard.


At a glance

What this is: This research argues that LLM security is a moving target, and that defenses must adapt to attacker behaviour without breaking legitimate use.

Why it matters: It matters because IAM, NHI, and AI governance teams now have to balance protection, usability, and runtime decision-making across emerging AI-driven applications.

👉 Read Lakera's research on adaptive defenses for LLM security and utility


Context

LLM security is no longer just a question of blocking malicious prompts. The harder problem is preserving useful behaviour for legitimate users while attackers keep changing tactics in response to model feedback, which makes static controls increasingly brittle.

For IAM and security teams, the issue is not only prompt injection or content filtering. It is the governance gap created when an application’s security posture is fixed at design time but the threat response needs to change during live interaction, especially as AI systems become more operationally embedded.


Key questions

Q: How should security teams evaluate LLM defenses in production?

A: They should evaluate both attack resistance and user utility. A defense that blocks more malicious prompts but also degrades legitimate completion quality may be operationally unacceptable. Good evaluation uses realistic sessions, measures attacker adaptation over time, and compares security gains against the loss in usability before deciding what to deploy.

Q: Why do static prompt defenses fail against AI attackers?

A: Static defenses fail because attackers learn from each refusal and refine their prompts accordingly. Once the adversary can observe model behaviour, the attack becomes iterative rather than one-off. That means the control must change with the threat, or it will eventually be mapped and bypassed.

Q: When does narrowing an LLM’s scope help security?

A: Narrowing scope helps when the application only needs a limited set of tasks and data domains. Constraining the system prompt or allowed behaviour reduces attack surface, but only if the restriction still allows legitimate users to complete their work. If the scope becomes too broad or too rigid, the defence loses value.

Q: How can teams tell whether an LLM defence is too strict?

A: A defence is too strict when it starts rejecting benign requests, shortening useful answers, or preventing core tasks from being completed. Those are signs that the model’s utility has been reduced past the point where the security gain is worth the operational cost.


Technical breakdown

Dynamic attacker behaviour in LLM security

LLM attackers do not usually stop after a single blocked attempt. They adapt prompts, exploit feedback, and search for weaknesses in the model’s response patterns. That changes security from a one-time policy problem into a live adversarial loop. A static rule set may look effective in a red-team demo but degrade quickly once an attacker learns what the system rejects, what it reveals, and how much persistence is required to get through. The core mechanism is iterative probing, not a single exploit path.

Practical implication: test LLM controls against adaptive adversaries, not just single-shot prompt samples.

Security-utility trade-offs in model prompting

LLM defenses affect more than whether a request is blocked. They also shape response length, response quality, and the range of legitimate tasks the application can still perform. System prompts can narrow scope, but they can also make a model less useful if they are too restrictive or too vague. That is why security for LLMs needs an explicit utility metric, not just a deny-or-allow decision. The operational question is how much function you are willing to constrain to reduce exposure.

Practical implication: define acceptable usability loss before tightening LLM controls in production.

Why defense in depth matters for AI applications

A single control rarely covers all LLM attack paths. One layer may reduce prompt leakage, another may catch suspicious session patterns, and a third may limit what the application is allowed to do. When combined, these controls reduce the chance that an attacker can rely on one predictable weakness. This is especially relevant when the model is embedded in a workflow where content, tools, and user interactions all interact. The security gain comes from overlapping constraints, not from one perfect filter.

Practical implication: combine prompt controls, session monitoring, and scoped capabilities instead of relying on one safeguard.


Threat narrative

Attacker objective: The attacker aims to make the model reveal restricted information or behave outside its intended operating scope.

  1. Entry occurs when an attacker interacts with the LLM through normal user-facing prompts and probes the system for weaknesses in its policy boundaries.
  2. Escalation happens as the attacker adapts to feedback from the model, learning which prompts bypass the defense and which interactions trigger refusal.
  3. Impact is the successful extraction of protected information or the steering of model behaviour in ways that undermine the intended security boundary.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.


NHI Mgmt Group analysis

Adaptive LLM security is now a governance problem, not just a prompt-engineering problem. Static controls assume the threat is stable enough to be measured once and enforced indefinitely. That assumption fails when attackers learn from model feedback and change tactics mid-session. The implication is that governance teams must treat model security as an ongoing control loop, not a one-time configuration.

Security-utility trade-offs have become the real control boundary for AI applications. The research makes clear that blocking more traffic is not automatically better if legitimate users lose response quality or task completion. That means success criteria must include utility, not only attack rejection. Practitioners should expect to justify where the acceptable balance sits for each use case.

Defense in depth remains the most credible pattern for LLM protection because no single layer covers adaptive abuse. Prompt scope restriction, session-based detection, and capability limits each address different failure modes. The field should stop treating content filters as a complete control set. Teams that build overlapping controls will have a materially better chance of containing adversarial iteration.

Gandalf-style interactive testing is valuable because it exposes how quickly real attackers can learn a model’s boundaries. Benchmarks that assume static adversaries understate the problem. The practical conclusion is that AI security testing must evaluate adaptation, not only initial block rates, if the result is meant to guide governance decisions.

The named concept here is the security-utility boundary. It is the point at which additional restriction starts to damage legitimate use faster than it reduces risk. That boundary is different for every AI application, which means practitioners need explicit acceptance criteria rather than generic model hardening goals.

From our research:

  • 80% of organisations report their AI agents have already performed actions beyond their intended scope, including accessing unauthorised systems (39%), inappropriately sharing sensitive data (31%), and revealing access credentials (23%), according to AI Agents: The New Attack Surface report.
  • Another finding from the same research shows that only 52% of companies can track and audit the data their AI agents access, leaving 48% with a compliance and investigation blind spot.
  • For a broader control lens, see OWASP Agentic AI Top 10 for the runtime risks that adaptive defences need to address.

What this signals

Security-utility boundary: teams will need to set explicit acceptance criteria for how much usability can be traded for stronger model defence. That shift will push AI governance closer to product risk management, especially where LLMs are already embedded in customer or employee workflows.

The practical signal is that red-teaming can no longer stop at first-pass jailbreak resistance. Use cases with model-driven actions should also be reviewed for downstream authority, because the same control gap that weakens prompt security can expand operational blast radius once the model is connected to tools.


For practitioners

  • Define the security-utility boundary for each AI use case Set a clear threshold for how much response quality, task completion, or latency you will accept before a defense is considered too restrictive for production.
  • Test controls against adaptive attacker sessions Run evaluations that let the same adversary adapt over multiple prompts, because single-turn red teaming will miss the learning loop that breaks static defenses.
  • Layer prompt scope, session controls, and capability limits Use more than one safeguard so that a weakness in one layer does not leave the model exposed to the same attack path.
  • Review autonomous workflows for hidden model authority If an LLM can influence downstream actions, narrow what it can access and validate which decisions still require human approval before execution.

Key takeaways

  • LLM security breaks down when controls assume attacker behaviour is static and the model’s response boundary will not be learned over time.
  • The evidence points to a real operational gap between protection and usability, which means governance needs measurable acceptance criteria, not slogans.
  • Adaptive testing and layered controls are the only credible way to keep AI systems useful while reducing the chance of iterative abuse.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
OWASP Agentic AI Top 10A1Adaptive attacker behaviour and tool abuse are core agentic AI risks in this article.
NIST AI RMFThe article centres on balancing risk controls against utility in AI systems.
NIST CSF 2.0PR.DS-1The post addresses access and output control for AI systems that may expose sensitive data.

Apply AI RMF governance to define acceptable risk, utility, and monitoring thresholds for LLM use.


Key terms

  • Adaptive Defenses: Security controls that change in response to attacker behaviour, rather than remaining fixed after deployment. In LLM environments, adaptive defenses use feedback, session history, or dynamic thresholds to respond to evolving prompts and reduce the chance that an attacker can learn a stable bypass.
  • Security-Utility Trade-off: The balance between preventing malicious behaviour and preserving enough legitimate function for users to complete their tasks. In AI systems, stricter controls can reduce risk but also shorten responses, block valid requests, or limit usefulness, so the acceptable balance must be set explicitly.
  • Dynamic Security and Utility Threat Model: A threat model that evaluates AI security with both attacker adaptation and user experience in view. It treats model behaviour as dynamic, which is important because the same control can improve resistance to abuse while simultaneously degrading legitimate performance or operational value.
  • Defense in Depth: A control strategy that combines multiple overlapping safeguards so that one failure does not expose the whole system. For LLMs, this means pairing prompt restrictions, session monitoring, and capability limits rather than relying on a single content filter or block rule.

What's in the full report

Lakera's full research article covers the operational detail this post intentionally leaves for the source:

  • The D-SEC threat model and how it formalises attacker adaptation alongside utility loss.
  • The session-completion and attacker-failure metrics used to compare defence strategies.
  • The examples of prompt restriction and defence layering that show how different controls change usability.
  • The Gandalf gameplay findings that illustrate how iterative red teaming exposes weaknesses static tests miss.

👉 Lakera's full article covers D-SEC, Gandalf findings, and the utility trade-offs behind each defence choice.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or identity governance in your organisation, it is worth exploring.
NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-09-08.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org