Subscribe to the Non-Human & AI Identity Journal

Why do prompt filters fail against indirect prompt injection?

Indirect injection hides malicious instructions inside content the agent is meant to process, such as documents, web pages, or emails. The model then sees both data and instruction in the same input stream and cannot reliably tell which is which. That is why the failure is structural, not just a matter of bad prompting.

Why This Matters for Security Teams

Prompt filters are built to catch suspicious text, but indirect prompt injection does not rely on obvious attack language. It hides instructions inside the very sources an agent is supposed to trust, such as web pages, tickets, PDFs, or emails. That means the control is trying to distinguish “data” from “instructions” after both have already been merged into the same context window, which is why the failure is structural.

For agentic systems, that matters because the prompt is not the control plane. The real risk is that an agent can treat untrusted content as executable guidance and then chain tools, forward secrets, or take actions the operator never intended. This is why current guidance from the OWASP Agentic AI Top 10 and NHIMG research on the OWASP Agentic Applications Top 10 treats indirect injection as an execution-risk problem, not a wording problem. In practice, many security teams encounter it only after an agent has already followed hostile instructions embedded in content it was explicitly asked to process.

How It Works in Practice

Indirect prompt injection succeeds because the model receives content and instructions in one stream and makes a probabilistic judgment about relevance. A prompt filter can block known jailbreak phrases, but it cannot reliably prove that a sentence inside a document is “only data” when the agent has been designed to read, summarize, extract, or act on that document. The attack surface expands further when the agent has tool access, because the injected instruction can tell it to search, exfiltrate, or transform information outside the original task.

Mitigations work best when they move beyond text scanning and into runtime governance. That includes separating instruction sources from untrusted content, constraining tool invocation, and evaluating policy at request time rather than relying on static prompt rules. In current practice, teams combine content sanitisation with allowlisted tools, scoped retrieval, and explicit human approval for sensitive actions. The OWASP Agentic AI Top 10 and the NHI failure patterns highlighted in DeepSeek breach both reinforce that attackers exploit trust boundaries, not just prompt wording.

  • Classify all retrieved content as untrusted, even when it comes from approved sources.
  • Keep system instructions, retrieval text, and user directives logically separate.
  • Use policy checks before tool calls, not only before model output.
  • Limit the agent’s ability to pass secrets between tools or contexts.
  • Log which source content influenced a high-risk action for later review.

Where this guidance breaks down most often is in highly autonomous agents with broad retrieval and file-system access, because the more tools and memory they have, the easier it is for a single injected instruction to propagate across multiple steps.

Common Variations and Edge Cases

Tighter prompt filtering often increases false positives and operational overhead, requiring organisations to balance usability against risk reduction. That tradeoff becomes sharper in environments where agents process long documents, chat histories, or mixed-trust knowledge bases, because the same language patterns can be benign in one context and malicious in another.

There is no universal standard for this yet, but current guidance suggests that indirect injection must be treated differently across workflow types. A customer-support summariser may need content isolation and constrained output formatting, while a code-assist agent may need stronger tool gating and repository trust boundaries. Organisations also need to consider that some attacks are not overt instructions at all, but hidden formatting, role-play cues, or cross-document contamination that only emerges after retrieval. NHIMG research on OWASP Agentic Applications Top 10 shows why these failures are especially dangerous when agents have persistence and delegation. Security teams should also align implementation with the broader control logic in OWASP Agentic AI Top 10 and zero-trust thinking, because filters alone do not create trust boundaries.

In practice, prompt filters are a useful signal layer, but they do not solve a trust-boundary problem created by autonomous systems processing adversarial content.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Agentic AI Top 10 A3 Directly addresses prompt injection and unsafe agent instruction following.
CSA MAESTRO T3 Covers trust boundaries and tool misuse in autonomous agent workflows.
NIST AI RMF Supports governing unpredictable model behaviour and contextual risk decisions.

Treat all retrieved content as untrusted and block tool use when instructions appear in data.