Subscribe to the Non-Human & AI Identity Journal

What breaks when prompt injection is handled only inside the model layer?

The control breaks because the malicious input has already entered the session before the model tries to interpret it. By that point, the system may already have exposed context, selected unsafe tools, or generated a harmful response. Effective defence needs inspection at the edge, plus identity and policy enforcement around tool use and data access.

Why This Matters for Security Teams

Prompt injection is not just a model-quality problem. If the only control sits inside the model layer, malicious content can already influence retrieval, tool selection, memory, or downstream actions before the model has a chance to “notice” the attack. That is why guidance from the OWASP Agentic AI Top 10 treats agentic abuse as a system security issue, not a prompt-filtering issue. The risk expands when the model can call tools, access secrets, or act on behalf of a user or service account.

Security teams often underestimate how fast a poisoned prompt can become a privileged action. Once the model has access to context windows, connectors, or execution APIs, the attack surface is no longer confined to text generation. NHI Management Group research shows how often secrets and non-human identities are exposed in real environments, including the finding that 96% of organisations store secrets outside secrets managers in vulnerable locations such as code, config files, and CI/CD tools, from the Ultimate Guide to NHIs. In practice, many security teams discover prompt injection only after a tool call or data exfiltration has already occurred, rather than through intentional model-layer inspection.

How It Works in Practice

Effective defence starts before the model sees the content and continues after the model responds. That means input inspection at the edge, retrieval filtering, strict tool mediation, and runtime policy checks around identity and data access. Model-layer classifiers can help, but they should be treated as one signal among many, not the primary control.

A practical control stack usually includes:

  • Edge filtering for untrusted instructions, including content from users, documents, web pages, and retrieved records.
  • Context partitioning so the model cannot freely blend user input with system prompts, secrets, or high-risk instructions.
  • Tool allowlisting with explicit authorization for each call, especially when the agent can send email, run code, or query internal systems.
  • Workload identity and short-lived credentials so the agent proves what it is and receives only task-scoped access.
  • Policy evaluation at request time, using policy-as-code instead of static prompt rules.

This aligns with the intent of the OWASP Agentic AI Top 10 and the operational direction in the OWASP Agentic Applications Top 10. The key lesson is that the model should never be the sole gatekeeper for tool use or data exposure. Once prompt injection can influence retrieval pipelines, browser tools, or connected secrets stores, a model-only control becomes too late because the environment has already been partially compromised.

Common Variations and Edge Cases

Tighter filtering often increases false positives and operational overhead, so organisations need to balance blocking hostile instructions against preserving legitimate workflows. Best practice is evolving for agentic systems that combine RAG, external tools, and autonomous planning, because there is no universal standard for prompt-injection handling yet.

Edge cases are especially difficult when the malicious payload is embedded in trusted-looking data, such as a support ticket, email thread, PDF, or web page the agent is asked to summarise. In those environments, model-only detection breaks down because the content may appear semantically valid while still carrying adversarial instructions. The same problem appears when the agent chains multiple tools: a seemingly harmless prompt can trigger search, retrieval, and then a privileged write action.

That is why current guidance suggests treating prompt injection as a control-plane issue across identity, policy, and data boundaries, not as a content moderation problem alone. Teams should also review how secrets are handled in adjacent systems, since exposed credentials make successful injection far more damaging than a text-only error. For a real-world example of how token exposure can cascade across development tooling, see the JetBrains GitHub plugin token exposure. These controls tend to break down when an agent has persistent memory plus direct access to internal tools, because a single injected instruction can survive across turns and execute outside the original prompt boundary.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Agentic AI Top 10 A3 Prompt injection is a core agentic application abuse pattern.
CSA MAESTRO TRUST-02 MAESTRO emphasizes trust boundaries across agent planning and tool use.
NIST AI RMF GOVERN AI RMF governance is needed to assign accountability for injection risks.

Inspect inputs outside the model and restrict agent actions with layered runtime controls.