Jailbreaks vs prompt injection: why AI defenses miss the point

By NHI Mgmt Group Editorial TeamPublished 2025-12-17Domain: Agentic AI & NHIsSource: Pillar Security

TL;DR: Security teams often conflate jailbreaks with prompt injection, but they attack different layers of an AI system: jailbreaks target model safety tuning, while prompt injection exploits how applications mix trusted instructions with untrusted content, according to Pillar Security. Treating them as synonyms creates blind spots because the right defense depends on whether the risk lives in the model, the application, or both.

At a glance

What this is: This analysis separates jailbreaks from prompt injection and shows that each targets a different layer of an AI system.

Why it matters: IAM and security teams need that distinction because the control set changes when access, instruction flow, and untrusted content interact across AI, NHI, and human governance.

👉 Read Pillar Security's analysis of why jailbreaks and prompt injection are not the same risk

Context

Prompt injection and jailbreaks are related but not interchangeable attack patterns. One targets the model’s refusal behaviour, while the other exploits the way applications combine trusted instructions with untrusted data.

For identity and access teams, the distinction matters because AI systems increasingly behave like NHI-enabled services that consume data, invoke tools, and expose sensitive actions through application trust boundaries. When the wrong threat model drives the control design, the programme ends up protecting the wrong layer.

Key questions

Q: How should security teams defend against both jailbreaks and prompt injection?

A: Treat them as separate attack classes. Use model-level hardening, refusal tuning, and adversarial testing for jailbreaks, then use application-level controls such as input isolation, provenance checks, and action gating for prompt injection. If one control is expected to solve both problems, the programme will leave a blind spot in the layer the attacker actually targets.

Q: Why do prompt injections remain dangerous even when the model seems well aligned?

A: Because alignment protects the model’s behaviour, not the application’s trust boundary. A well aligned model can still follow malicious instructions that arrive inside trusted-looking content if the application concatenates them into the same prompt stream. The risk is architectural, so alignment alone cannot prevent it.

Q: What do teams get wrong when they rely on prompt filters alone?

A: They often assume the filter can spot every harmful instruction by wording alone. That misses indirect prompt injection, where the payload looks normal in context and only becomes malicious because the application treats external content as instruction-bearing. Filters help, but they do not replace source separation and execution constraints.

Q: How do security teams decide which controls to prioritise for AI applications?

A: Start by asking whether the main exposure is model-level refusal bypass or application-level instruction hijacking. If attackers are trying to coerce the model into unsafe behaviour, prioritise alignment and jailbreak detection. If untrusted content can influence tools or data access, prioritise trust separation, provenance, and output validation.

Technical breakdown

Jailbreaks target model alignment, not application trust boundaries

A jailbreak tries to make the model ignore the refusal behaviour learned during safety tuning such as RLHF and instruction tuning. The attacker usually uses role-play, hypothetical framing, or instruction overrides to push the model into a different interpretation of the request. That is an attack on the model’s internal safety layer, not on the surrounding application architecture. Detection often works because the prompt itself looks adversarial, with obvious override language or obfuscation. The important point is that the model is being pressured to violate its own training constraints, which means the primary control problem sits at the model layer.

Practical implication: evaluate model-level hardening and adversarial detection separately from application guardrails.

Prompt injection exploits instruction concatenation and trust boundaries

Prompt injection works when an application concatenates trusted system instructions with untrusted text from users, files, emails, or web pages. The model then receives a single text stream and cannot reliably distinguish developer instructions from hostile instructions hidden inside the content. Indirect prompt injection is especially dangerous because the attacker never has to type the malicious instruction directly into the chat box. The payload can live in a document or message that the application was already designed to process, which makes the vulnerability architectural rather than linguistic.

Practical implication: isolate untrusted inputs, constrain tool use, and validate outputs before they can drive actions.

Why signatures catch jailbreaks but miss many prompt injections

Jailbreaks often leave adversarial markers because the attacker is fighting against model resistance. Prompt injection often does the opposite. A malicious instruction can be phrased to look like ordinary content for the application, such as a normal-looking email instruction that is only harmful in context. That means signature-based detection can be useful for refusal bypass attempts yet still miss attacks that inherit legitimacy from the surrounding workflow. In security terms, the signal lives in the provenance and placement of the instruction, not just in its wording.

Practical implication: pair prompt classifiers with content provenance checks and action-level authorization.

NHI Mgmt Group analysis

Jailbreaks and prompt injection are different control problems, not a single AI risk bucket. Jailbreaks attack safety tuning at the model layer, while prompt injection attacks the application layer where trusted and untrusted strings are merged. Security teams that collapse them into one label end up selecting the wrong control family and measuring the wrong failure mode. The practitioner conclusion is simple: threat classification must drive control selection, not vendor marketing terms.

Prompt injection exposes an identity governance problem as much as a content problem. Once an AI application can read data and act on it, it becomes an NHI-like execution path that needs explicit trust separation, privilege boundaries, and action scoping. The issue is not only whether the prompt is malicious, but whether the application is allowed to treat untrusted material as instruction-bearing. Practitioners should treat instruction provenance as part of the identity control plane.

Trust-boundary collapse: the application assumes it can safely concatenate policy text and external content because the model will understand provenance. That assumption fails when the actor can only see a single prompt stream and has no intrinsic notion of source trust. The implication is that teams must rethink how instruction authority is represented in AI workflows, because provenance cannot be recovered after concatenation. The practitioner conclusion is to design for separation before inference, not detection after the fact.

Signature-based detection is necessary but structurally incomplete for AI security. It can catch adversarial jailbreak language because the attacker often has to fight the model’s resistance. It cannot reliably catch indirect prompt injection when the payload is contextually normal and only dangerous because of where it appears. The field needs layered controls that combine model hardening with application-level trust enforcement. The practitioner conclusion is that a single prompt filter is not an AI security strategy.

The real boundary is not between safe and unsafe text, but between trusted and untrusted instruction sources. That boundary becomes visible only when teams model how an AI system ingests data, invokes tools, and returns actions. Once that path is understood, the governance question changes from “Can we block bad prompts?” to “Which sources are allowed to issue instructions at all?” The practitioner conclusion is to govern instruction authority like any other privileged control plane.

From our research:
96% of technology professionals identify AI agents as a growing security threat, and 66% believe this risk is immediate, according to AI Agents: The New Attack Surface report.
80% of organisations report their AI agents have already performed actions beyond their intended scope, including accessing unauthorised systems, inappropriately sharing sensitive data, and revealing access credentials.
The governance response is developing fast in OWASP NHI Top 10, where instruction hijack and tool misuse are treated as separate risk surfaces.

What this signals

Prompt injection is an instruction-governance problem, not just a content-security problem. As AI applications begin to act on behalf of teams, provenance and trust boundaries must be treated as part of the identity plane. That shift becomes visible in our research on agent behaviour, where 33% of organisations report AI agents accessing inappropriate or sensitive data beyond their intended scope, according to the AI Agents: The New Attack Surface report.

Security programmes should expect AI control design to split into two tracks: model-layer assurance and application-layer authorisation. That split maps closely to the attack mechanics described here and aligns with the control logic in OWASP Agentic AI Top 10 and MITRE ATLAS adversarial AI threat matrix.

Instruction provenance gap: the operative question for practitioners is which content sources are allowed to influence actions, not whether the model can detect every malicious phrase. Teams that cannot answer that question should assume their AI workflow is already blending trusted policy with untrusted input.

For practitioners

Separate model risk from application risk Assess jailbreak exposure with model hardening and red teaming, then assess prompt injection with application architecture reviews. Treat them as different failure classes with different owners and different control objectives.
Classify trusted and untrusted instruction sources Map every place where external content can be concatenated into prompts, including documents, emails, webpages, and support tickets. Mark which of those sources may influence tool use, data access, or outbound communication.
Constrain tool execution after untrusted input Block high-risk actions unless the request is revalidated outside the prompt stream. Limit read, write, and send capabilities so untrusted content cannot directly trigger sensitive side effects.
Add provenance checks to AI workflows Require the system to preserve source context for ingested content and to log when that content affects downstream actions. Use those records to distinguish legitimate instruction paths from injected ones.

Key takeaways

Jailbreaks and prompt injection target different layers, so they require different defenses and different owners.
Signature-based detection helps with adversarial prompts, but it does not reliably stop indirect prompt injection through trusted workflows.
AI governance needs instruction provenance, trust separation, and action scoping, not just model alignment and content filtering.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and MITRE ATLAS address the attack and risk surface, while NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	LLM01	Covers jailbreaks and prompt injection as distinct agentic AI threats.
MITRE ATLAS	AML.T0054	Captures AI prompt manipulation and adversarial input techniques.
NIST CSF 2.0	PR.AC-4	Access control needs to limit what AI workflows can do after untrusted input arrives.

Map prompt handling and tool permissions to LLM01 and test both direct and indirect injection paths.

Key terms

Jailbreak: A jailbreak is an attempt to make a language model ignore its safety tuning and produce behaviour it was trained to refuse. The attack targets the model’s refusal layer, usually through role-play, override language, or other framing that changes how the model interprets the request.
Prompt Injection: Prompt injection is an attack that inserts malicious instructions into content an AI application processes as if it were trustworthy. It exploits the application’s trust boundary, especially when external text is merged with developer instructions before the model acts on them.
Indirect Prompt Injection: Indirect prompt injection hides hostile instructions inside external content such as documents, emails, or webpages that the application later ingests. The attacker does not need to type the prompt directly, which makes the attack harder to detect with simple keyword or signature controls.
Instruction Provenance: Instruction provenance is the ability to tell where an instruction came from and whether it is trusted. In AI governance, it matters because models cannot reliably separate developer policy from untrusted content once everything is concatenated into one prompt stream.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.

This post draws on content published by Pillar Security: The terminology problem causing security teams real risks. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-12-17.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org