Subscribe to the Non-Human & AI Identity Journal

What do security teams get wrong about prompt injection defence?

They often assume better blocklists will solve the problem, but obfuscation simply changes the shape of the payload. Real defence requires examining meaning across the full interaction, including retrieved content and model responses. If the control cannot interpret intent, it will keep missing the attack class it is meant to stop.

Why This Matters for Security Teams

Prompt injection defence fails when teams treat it like a simple content-filtering problem. That mindset works only for obvious abuse, but attackers can hide instructions in retrieved documents, tool outputs, markup, or benign-looking text that the model still interprets as tasking. The real risk is not the phrase itself, but the intent carried through the interaction. Guidance from the OWASP Agentic AI Top 10 and NHI-focused research from OWASP Agentic Applications Top 10 both point to the same issue: static rules rarely keep pace with adversarial variation.

Security teams also underestimate how often prompt injection becomes an identity and authorisation problem. Once a model can call tools, retrieve data, or trigger workflows, a successful injection can turn into unauthorised access, data exfiltration, or unsafe action execution. The control surface is wider than the prompt box. It includes connectors, memory, retrieval layers, and downstream systems that the model can influence indirectly. Current guidance suggests that the most effective programmes treat this as an end-to-end trust problem, not a single input validation issue.

In practice, many security teams encounter prompt injection only after a model has already chained a malicious instruction into a tool call or external request.

How It Works in Practice

Effective defence starts by assuming that the model will see hostile content somewhere in the workflow. The objective is not to stop all injection strings, but to detect when content is trying to change the agent’s task, scope, or policy boundaries. That means inspecting retrieved passages, tool responses, and model outputs for instruction-like behaviour, rather than scanning only user prompts. The OWASP Agentic AI Top 10 is useful here because it frames prompt injection alongside tool misuse, data leakage, and overbroad autonomy.

In practice, teams need layered controls:

  • Separate untrusted content from system instructions and make that boundary explicit in orchestration.
  • Use intent-aware policy checks before a model can act on retrieved or external content.
  • Limit tool scope with least privilege so a poisoned prompt cannot reach high-impact actions.
  • Log the full chain of prompt, retrieval, tool call, and response for investigation and tuning.
  • Test with adversarial examples that include obfuscation, encoding tricks, and indirect injection through documents.

This is also where NHI governance matters. If the model is using service accounts, API keys, or delegated workflows, a successful injection can abuse those identities even when the prompt itself looks harmless. NHIMG research shows that 80% of identity breaches involved compromised non-human identities such as service accounts and API keys, which is why prompt defence and identity control should be designed together, not separately. The OWASP Agentic Applications Top 10 is a practical reference for connecting model abuse to downstream identity exposure.

These controls tend to break down in retrieval-heavy environments with broad tool access because the model can be steered through trusted-looking content faster than human reviewers can inspect it.

Common Variations and Edge Cases

Tighter prompt controls often increase operational overhead, requiring organisations to balance security gain against false positives, analyst fatigue, and slower workflows. That tradeoff is real, especially in customer-facing copilots or internal knowledge assistants where legitimate instructions and malicious instructions can look similar. Best practice is evolving, and there is no universal standard for perfect prompt injection detection yet.

One common edge case is indirect injection through content that was never meant to be executable, such as emails, tickets, PDFs, or web pages. Another is tool-mediated injection, where the model receives safe-looking text from a connector but the downstream action is unsafe. Teams also miss cases where the model’s own response becomes the attack vehicle, for example when it is prompted to relay secrets, summarise restricted data, or reformat text in a way that exposes hidden instructions.

For that reason, practitioners should align detection with runtime policy and identity boundaries, not just text filters. The most useful references here are OWASP Agentic AI Top 10, OWASP Agentic Applications Top 10, and the broader NHI Mgmt Group guidance on identity sprawl and privileged access. The lesson is simple: if the control cannot judge meaning, context, and authority together, it will miss the attack shape that matters most.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Agentic AI Top 10 A2 Prompt injection is a core agentic application abuse path.
CSA MAESTRO MAESTRO addresses agentic workflow risk and control-plane governance.
NIST AI RMF AIRMF covers trustworthy AI risk management and operational controls.

Treat model inputs, retrieval, and tool outputs as untrusted and gate actions by runtime policy.