TL;DR: Claude Sonnet 4 performs better than recent alternatives against real-world jailbreaks, prompt injection, and hidden-context attacks, according to Lakera research, while still remaining vulnerable to advanced multi-turn adversarial techniques and indirect prompt manipulation. The finding is simple: model quality alone does not close enterprise GenAI risk, and layered guardrails remain mandatory.
At a glance
What this is: This is an independent security analysis of Claude 4 Sonnet that finds stronger resistance to adversarial prompting than some peers, but not immunity.
Why it matters: It matters because enterprise IAM, NHI, and AI governance teams need to treat model selection, guardrails, and access control as connected controls rather than separate problems.
👉 Read Lakera's analysis of Claude 4 Sonnet and enterprise LLM security
Context
Enterprise LLM security is now a governance problem, not just a model-quality problem. Claude 4 Sonnet is presented here as a case study in how adversarial resistance, jailbreak handling, and prompt-injection robustness can vary even when benchmark performance looks strong.
For IAM and security teams, the practical issue is that a model can be useful and still be fragile under pressure. That means AI access decisions, retrieval boundaries, and guardrails need to be evaluated together, especially where models sit inside workflows that touch sensitive data or privileged tools.
Key questions
Q: How should security teams test enterprise LLMs for prompt injection risk?
A: Test the model inside the real application path, not in isolation. Include hidden instructions, malicious retrieved content, multi-turn escalation, and language variations, then measure whether the model reveals policy text, ignores constraints, or produces unsafe tool instructions. A model that passes simple prompts may still fail once context, memory, and retrieval are involved.
Q: When should organisations treat a model update as a security change?
A: Whenever the new model will touch sensitive data, privileged workflows, or external tools. A release that improves reasoning can still weaken adversarial resistance, so teams should re-run red-team tests, compare refusal behaviour, and review downstream access paths before adoption.
Q: What do security teams get wrong about built-in model safeguards?
A: They often assume built-in safeguards replace external controls. In practice, refusal training can reduce risky output, but it does not prevent unsafe retrieval, poisoned context, or downstream misuse. Safe deployment still depends on input filtering, access scoping, and monitoring around the model.
Q: How can teams keep GenAI systems usable without overblocking safe requests?
A: Set guardrails to target malicious patterns rather than broad content classes. Overly aggressive filters can block legitimate work, so teams should tune policies against real business prompts, measure false positives, and separate safety enforcement from user experience where possible.
Technical breakdown
Prompt injection resistance in enterprise LLMs
Prompt injection is any attempt to override a model’s intended behaviour through crafted instructions, hidden context, or malicious retrieved content. In enterprise settings, the key issue is not whether a model can answer questions, but whether it can be manipulated into revealing system prompts, ignoring policy, or following attacker-supplied instructions. Lakera’s testing suggests some models are more resilient than others, but no frontier model is fully resistant. The technical takeaway is that attack resistance depends on the interaction between model behaviour, context handling, and the surrounding control plane.
Practical implication: test prompt-injection paths in your actual application stack, not just in isolated model demos.
Why model regressions matter for GenAI governance
Security regressions are especially important because enterprise teams often assume newer model versions are safer by default. That assumption fails when a model improves on reasoning or usability but weakens against adversarial inputs, hidden instructions, or multi-turn manipulation. In practice, regression means the control baseline changes underneath the programme. A model that was acceptable in one release may no longer meet the same risk tolerance in the next release, even if product messaging suggests overall improvement.
Practical implication: treat model upgrades like security changes and re-run red-team validation before production rollout.
Constitutional classifiers and layered guardrails
Constitutional classifiers are a model-side defence that encourages safer outputs by steering responses toward policy-aligned behaviour. They can reduce universal jailbreak success, but they do not replace external controls such as content filtering, retrieval filtering, identity scoping, and output inspection. The architectural lesson is that model-level refusal behaviour is only one layer in a broader security stack. If the surrounding environment still permits unsafe context, the model can remain exploitable even when its refusals improve.
Practical implication: place guardrails around retrieval, tool access, and output use, not only inside the model.
Threat narrative
Attacker objective: The attacker aims to make the model disclose hidden instructions, ignore safeguards, or produce unsafe content that can be leveraged inside enterprise workflows.
- Entry occurs through crafted prompts, multi-turn manipulation, or malicious context embedded in retrieved content that reaches the model.
- Escalation happens when the model accepts the injected instruction, reveals hidden context, or deviates from the intended policy boundary.
- Impact is unsafe disclosure or unsafe action guidance inside production GenAI workflows, which can expose confidential data or weaken enterprise controls.
Breaches seen in the wild
- Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
- AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.
Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.
NHI Mgmt Group analysis
LLM security is becoming an access-control problem, not just a model-safety problem. Once a model can be manipulated through prompt injection or retrieved content, the security question shifts from output quality to authority boundaries. That is why GenAI governance has to sit alongside IAM, not outside it. Practitioners should treat model behaviour as part of the access plane, not just the application layer.
Security regressions in newer models prove that capability gains and trust gains are not the same thing. A newer release can be better at reasoning while being worse under adversarial pressure, which means security baselines cannot be inferred from benchmark performance alone. This is a programme design issue, not a tuning issue. Teams need validation criteria that test attack resistance before they let a new model into privileged workflows.
Prompt injection is the enterprise equivalent of untrusted input gaining policy influence. The model may be technically isolated, but the application around it is not if retrieved documents, user messages, or chain-of-thought-adjacent context can alter behaviour. That makes context sanitisation, retrieval scoping, and output controls part of the identity security stack. Practitioners should assume the attack surface includes every place the model can be steered.
Layered defence is now the only defensible posture for enterprise GenAI. Constitutional classifiers, red teaming, context filtering, and response governance each reduce risk in a different part of the chain, but none can stand alone. The maturity question is no longer whether a model is “safe enough”, but whether the surrounding control framework can absorb failure without exposing sensitive systems. Teams should design for bounded compromise, not perfect refusal.
Runtime prompt steering: The defining failure mode here is that model behaviour can be altered after deployment by adversarial context that was never visible at approval time. That makes static model review an incomplete control, because the real risk appears only when the model is operating inside live enterprise workflows. The implication is that governance must shift from one-time approval to continuous adversarial validation.
From our research:
- 96% of technology professionals identify AI agents as a growing security threat, and 66% believe this risk is immediate, according to AI Agents: The New Attack Surface report.
- Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation.
- For the governance angle that sits beside this risk, see OWASP Agentic AI Top 10 for the control failures most likely to surface in production.
What this signals
Runtime context control is becoming the real enterprise differentiator. If a model can be steered by retrieved content, hidden instructions, or conversation drift, then the security programme has to govern inputs, not just outputs. That is why model approval alone will not satisfy identity, risk, or audit stakeholders when the model can reach sensitive systems.
The next maturity step for GenAI programmes is to treat red teaming as a recurring control, not a one-time assurance activity. Model upgrades, prompt-template changes, and retrieval-source changes all alter the attack surface. Teams that do not re-validate after those changes will discover their baseline only after the first incident.
Adversarial context is the new control boundary. Once that boundary is understood, IAM and security teams can separate benign user interaction from unsafe instruction influence, and they can apply tighter scoping around tools, memory, and retrieval paths.
For practitioners
- Validate model behaviour under live adversarial scenarios Test prompt injection, indirect prompt injection, hidden-context leakage, and multi-turn jailbreak paths in the same retrieval and tool stack you will use in production.
- Re-assess model upgrades as security changes Run the same red-team suite after every model refresh, because security regressions can appear even when benchmark performance improves.
- Scope retrieval and context inputs tightly Limit what documents, memory, and prior turns can influence the model, and strip untrusted content before it reaches policy-sensitive prompts.
- Add independent output controls Use filters, policy checks, and human review gates for sensitive actions so a model refusal improvement does not become a false sense of safety.
Key takeaways
- Enterprise LLM security depends on adversarial resilience, not benchmark scores alone, because newer models can regress under prompt injection and hidden-context attacks.
- Built-in refusals and constitutional classifiers reduce risk, but they do not replace independent controls around retrieval, output use, and tool access.
- Programmes that re-test every model update and scope context tightly will be better positioned to keep GenAI usable without turning it into an uncontrolled security channel.
Standards & Framework Alignment
This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.
OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.
| Framework | Control / Reference | Relevance |
|---|---|---|
| OWASP Agentic AI Top 10 | Prompt injection and tool steering are core agentic AI threat patterns. | |
| NIST AI RMF | Model risk, governance, and validation are central to safe enterprise deployment. | |
| NIST CSF 2.0 | PR.DS-5 | Data protection and control of sensitive information align with context leakage risk. |
Map GenAI data flows and enforce controls that prevent sensitive content from reaching unsafe prompts.
Key terms
- Prompt Injection: Prompt injection is an attack that manipulates a model through crafted instructions hidden in user input, documents, or retrieved context. The goal is to override intended behaviour, leak protected information, or trigger unsafe actions. In enterprise systems, the risk grows when the model can act on external context without strict input boundaries.
- Constitutional Classifier: A constitutional classifier is a model-side defence that steers responses toward policy-aligned behaviour using a set of guiding principles. It can reduce harmful output and jailbreak success, but it does not eliminate risk from poisoned context, unsafe retrieval, or downstream misuse. It is one layer, not a complete control surface.
- Adversarial Validation: Adversarial validation is the practice of testing a model or system against realistic attack patterns before and after deployment. It checks whether hidden instructions, multi-turn pressure, and malicious context can change behaviour. For enterprise GenAI, it is more useful than synthetic benchmark confidence because it reflects live operational risk.
What's in the full article
Lakera's full article covers the operational detail this post intentionally leaves for the source:
- The benchmark categories used to compare Claude Sonnet 4, LLaMA 4 Maverick, and GPT 4.1 under adversarial pressure.
- Examples of the prompt-injection, multi-turn, and hidden-context test patterns used in Lakera's evaluation.
- The model-by-model behaviour differences across content injection, hidden instruction extraction, and indirect attack scenarios.
- The constitutional classifier behaviour example that the source uses to probe model refusal logic in practice.
Deepen your knowledge
NHI governance, agentic AI identity, and machine identity lifecycle are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.
Published by the NHIMG editorial team on 2025-08-26.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org