What do organisations get wrong about prompt injection and jailbreak risk?

Why This Matters for Security Teams

Prompt injection and jailbreak risk are often dismissed as “model safety” problems, but the real exposure is broader: untrusted text can become instructions, and connected tools can turn those instructions into action. That is why the threat pattern is now treated in OWASP Agentic AI Top 10 as a system-level issue, not a prompt-only issue. NHI Management Group’s OWASP NHI Top 10 and OWASP Agentic Applications Top 10 both point to the same operational failure: teams secure the model boundary but leave retrieval sources, plugins, and downstream actions effectively trust-first.

That gap matters because an attacker does not need to “break” the model in the classic sense. They only need to place malicious instructions in a document, webpage, ticket, email, or API response that the model will consume. Once the model is allowed to summarize, retrieve, route, or execute with privileges, the attack becomes an application security and identity problem. In practice, many security teams encounter this only after the model has already disclosed data or triggered an unsafe action, rather than through intentional testing.

How It Works in Practice

The safest mental model is that any content the system ingests may be adversarial until proven otherwise. That includes user prompts, retrieved documents, chat history, tool outputs, web content, and hidden system instructions copied into the context window. The model may follow the malicious instruction if the surrounding application fails to separate data from directives.

Classify inputs by trust level before they reach the model, especially retrieved content and external tool outputs.

Strip or quarantine instruction-like text from untrusted sources where possible, and label it explicitly as data.

Constrain tools with least privilege so the model cannot read, write, or send more than it needs.

Require policy checks at runtime for high-risk actions, instead of assuming a prompt template will hold.

Log the full chain of prompt, retrieval, and tool use so suspicious instruction paths can be investigated.

This aligns with the direction of the NIST Cybersecurity Framework 2.0, which emphasizes governance, protection, detection, and response across the system, not just at one control point. It also maps to the NHI lesson highlighted in Ultimate Guide to NHIs — Why NHI Security Matters Now: if a workload can act, it must be governed as an identity-bearing actor with constrained authority.

Operationally, this works best when prompt handling, retrieval security, and tool authorization are designed together. These controls tend to break down when an environment mixes untrusted public content with privileged internal workflows because the model cannot reliably tell the difference unless the application makes that difference explicit.

Common Variations and Edge Cases

Tighter prompt filtering often increases false positives and workflow friction, requiring organisations to balance safety against usability and throughput. That tradeoff is real, and current guidance suggests there is no universal standard for how much filtering is enough. For some use cases, aggressive blocking is appropriate; for others, contextual review or human approval is the better control.

The hardest edge cases are agentic systems and RAG-heavy workflows. In those environments, the attack surface is not just the prompt but the entire chain of retrieved text, intermediate reasoning, and tool calls. A poisoned knowledge base entry, a malicious ticket comment, or a crafted document can all become instruction sources if the application does not label and isolate them. This is why the Top 10 NHI Issues and the vendor-researched 2024 ESG Report: Managing Non-Human Identities are relevant here: compromise usually appears when trust is granted too broadly to non-human actors and their connected services.

One practical nuance is that jailbreak testing against the model alone can create false confidence. The same prompt may be harmless in a sandbox but dangerous once paired with a retrieval layer, long-lived tokens, or write-capable tools. Organisations should therefore test the full workflow, not just the prompt.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A1	Prompt injection and jailbreaks are core agentic application threats.
CSA MAESTRO	A2	MAESTRO covers securing agent behavior, tools, and orchestration paths.
NIST AI RMF		AI RMF addresses governance and operational risk for AI systems.

Test the full agent workflow for malicious instructions in inputs and tool paths.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

What do organisations get wrong about prompt injection and jailbreak risk?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group