TL;DR: Claude Sonnet 4 performs better than recent alternatives against real-world jailbreaks, prompt injection, and hidden-context attacks, according to Lakera research, while still remaining vulnerable to advanced multi-turn adversarial techniques and indirect prompt manipulation. The finding is simple: model quality alone does not close enterprise GenAI risk, and layered guardrails remain mandatory.
NHIMG editorial — based on content published by Lakera: Claude 4 Sonnet and enterprise LLM security under adversarial pressure
Questions worth separating out
Q: How should security teams test enterprise LLMs for prompt injection risk?
A: Test the model inside the real application path, not in isolation.
Q: When should organisations treat a model update as a security change?
A: Whenever the new model will touch sensitive data, privileged workflows, or external tools.
Q: What do security teams get wrong about built-in model safeguards?
A: They often assume built-in safeguards replace external controls.
Practitioner guidance
- Validate model behaviour under live adversarial scenarios Test prompt injection, indirect prompt injection, hidden-context leakage, and multi-turn jailbreak paths in the same retrieval and tool stack you will use in production.
- Re-assess model upgrades as security changes Run the same red-team suite after every model refresh, because security regressions can appear even when benchmark performance improves.
- Scope retrieval and context inputs tightly Limit what documents, memory, and prior turns can influence the model, and strip untrusted content before it reaches policy-sensitive prompts.
What's in the full article
Lakera's full article covers the operational detail this post intentionally leaves for the source:
- The benchmark categories used to compare Claude Sonnet 4, LLaMA 4 Maverick, and GPT 4.1 under adversarial pressure.
- Examples of the prompt-injection, multi-turn, and hidden-context test patterns used in Lakera's evaluation.
- The model-by-model behaviour differences across content injection, hidden instruction extraction, and indirect attack scenarios.
- The constitutional classifier behaviour example that the source uses to probe model refusal logic in practice.
👉 Read Lakera's analysis of Claude 4 Sonnet and enterprise LLM security →
Claude 4 Sonnet versus adversarial prompts: are controls keeping up?
Explore further
LLM security is becoming an access-control problem, not just a model-safety problem. Once a model can be manipulated through prompt injection or retrieved content, the security question shifts from output quality to authority boundaries. That is why GenAI governance has to sit alongside IAM, not outside it. Practitioners should treat model behaviour as part of the access plane, not just the application layer.
A few things that frame the scale:
- 96% of technology professionals identify AI agents as a growing security threat, and 66% believe this risk is immediate, according to AI Agents: The New Attack Surface report.
- Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation.
A question worth separating out:
Q: How can teams keep GenAI systems usable without overblocking safe requests?
A: Set guardrails to target malicious patterns rather than broad content classes. Overly aggressive filters can block legitimate work, so teams should tune policies against real business prompts, measure false positives, and separate safety enforcement from user experience where possible.
👉 Read our full editorial: Claude 4 Sonnet and enterprise LLM security under adversarial pressure