How should security teams defend against both jailbreaks and prompt injection?

Treat them as separate attack classes. Use model-level hardening, refusal tuning, and adversarial testing for jailbreaks, then use application-level controls such as input isolation, provenance checks, and action gating for prompt injection. If one control is expected to solve both problems, the programme will leave a blind spot in the layer the attacker actually targets.

Why This Matters for Security Teams

Prompt injection and jailbreaks are related only at the surface. Jailbreaks target the model itself, trying to override safety behaviour through crafted prompts, while prompt injection targets the application around the model by smuggling instructions through retrieved content, tool outputs, or user-supplied data. Defenders who collapse them into one problem usually miss the actual trust boundary being attacked. That distinction is central in the OWASP Agentic AI Top 10, which treats model compromise and application compromise as different control problems.

For security teams, the operational risk is that a system can appear hardened because the model refuses unsafe requests, while the surrounding workflow still accepts poisoned context and unsafe tool calls. NHI Management Group research on the State of Non-Human Identity Security shows how often organisations underestimate adjacent control failures, with only 1.5 out of 10 highly confident in securing NHIs. That confidence gap matters here because agents, plugins, and service accounts are often the execution path after a successful injection.

In practice, many security teams encounter prompt injection only after a tool has already been called with attacker-controlled instructions, rather than through intentional testing of the application boundary.

How It Works in Practice

A workable defence starts by separating the controls by layer. Jailbreak resistance belongs at the model layer: refusal tuning, adversarial red-teaming, safety evaluations, and monitoring for bypass patterns that alter the model’s own behaviour. Prompt injection belongs at the application layer: isolate untrusted input, strip instruction-like content from retrieved material where appropriate, label provenance, and gate any action that could change state, exfiltrate data, or call external systems.

That means security teams should not rely on a single “safe prompt” wrapper. Instead, they should combine runtime checks with workflow controls. The most effective patterns are:

Keep untrusted text in a data channel, not an instruction channel.
Tag retrieved content, user uploads, and tool responses with provenance metadata.
Require explicit policy checks before the model can invoke a tool or emit sensitive output.
Use allowlisted actions and least-privilege service identities for every downstream call.
Test for cross-boundary abuse, including hidden instructions in HTML, PDFs, logs, and copied emails.

This lines up with the current guidance in the OWASP Agentic Applications Top 10 and the threat patterns covered by CISA cyber threat advisories, both of which emphasise that trustworthy outputs depend on trustworthy inputs and constrained actions.

For teams handling autonomous workflows, the practical lesson is that prompt injection becomes a privilege escalation path when the model can read, decide, and act without a separate control gate. These controls tend to break down when agents are allowed to chain tools across loosely governed data sources because the injected instruction can survive multiple transformations before execution.

Common Variations and Edge Cases

Tighter filtering often increases false positives and can degrade usefulness, so organisations have to balance user experience against the risk of hidden instructions being preserved in normal business content. That tradeoff is especially visible in retrieval-augmented systems, email assistants, and document-processing workflows, where the model must read mixed-trust material rather than clean prompts.

Best practice is evolving for multi-agent systems. Current guidance suggests that each agent should have its own scoped identity, policy envelope, and action boundary, because one compromised agent should not be able to steer the whole workflow. In some deployments, content classification is useful, but it is not a complete defence. An attacker can still embed an instruction in a benign-looking support ticket, code comment, or meeting transcript.

For jailbreaks, adaptive adversarial testing is still necessary because safety tuning does not eliminate all bypass techniques. For prompt injection, policy must be evaluated at runtime against context, provenance, and intended action. If either control is missing, the system becomes vulnerable in the layer the attacker actually targeted. That distinction is also why the DeepSeek breach is a useful warning: sensitive model-adjacent assets and exposed data can amplify the impact of a successful prompt-based attack.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	LLM01	Prompt injection and jailbreaks map directly to LLM abuse and instruction-hijacking risk.
CSA MAESTRO	TR-2	Covers runtime trust and policy enforcement for agentic workflows and tool use.
NIST AI RMF		AI RMF governance is relevant to managing model and workflow risks across the lifecycle.

Test prompts adversarially and isolate untrusted context before any model-generated instruction can trigger action.

How should security teams defend against both jailbreaks and prompt injection?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group