Security teams should separate untrusted data from executable instructions, enforce runtime policy checks before tool use, and monitor outbound destinations for abuse. Prompt filtering alone is not enough because indirect prompt injection often arrives through trusted business data. The control goal is to stop the agent from treating attacker-controlled content as authority.
Why This Matters for Security Teams
Prompt injection is not just a content-filtering problem. In agent workflows, a malicious instruction can arrive inside a ticket, email, document, web page, or chat thread that the agent already trusts enough to process. Once the agent has tool access, that injected text can be converted into action: data exposure, unauthorised lookups, workflow manipulation, or outbound exfiltration. Current guidance from the OWASP Agentic AI Top 10 and the NIST AI Risk Management Framework both point toward runtime controls, not static trust assumptions.
NHIMG research shows why this matters operationally: in the AI Agents: The New Attack Surface report, 80% of organisations said their AI agents had already acted beyond intended scope, including unauthorised system access, sensitive data sharing, and credential disclosure. That is the real risk: prompt injection becomes a privilege problem the moment an agent can execute commands. In practice, many security teams encounter this only after an agent has already followed attacker-controlled text as if it were a legitimate business instruction.
How It Works in Practice
Defence starts by separating untrusted content from executable instructions. The agent should never parse business data, web content, or user-provided text as if it were policy. Instead, treat those inputs as inert context, and apply a strict instruction hierarchy so only signed system prompts, approved templates, and runtime policy decisions can authorise action. This is consistent with CSA MAESTRO agentic AI threat modeling framework guidance and the OWASP NHI Top 10, which both emphasise controlling how identity and authority are consumed at runtime.
Security teams should pair that with policy checks before every tool invocation. The useful question is not “Did the prompt look clean?” but “Should this agent do this task, with this data, at this moment?” That means:
- Enforcing allowlisted tools and destinations, with per-action approval logic for high-risk operations.
- Using intent-based authorisation so the agent’s goal, data sensitivity, and execution context are evaluated together.
- Issuing short-lived, JIT credentials and ephemeral secrets per task, then revoking them immediately after use.
- Logging the prompt, tool decision, target system, and outbound destination so abuse can be investigated.
For implementations, runtime policy-as-code is the safer pattern. Teams often map this to OPA, Cedar, or similar decision engines, while using workload identity such as SPIFFE/SPIRE or OIDC claims to prove what the agent is, not just what secret it holds. This matters because static RBAC alone cannot keep pace with autonomous, goal-driven behaviour; the agent may chain tools in ways no pre-defined role model anticipated. These controls tend to break down when the agent is allowed to browse arbitrary external content and invoke broad side-effecting tools in the same transaction.
Common Variations and Edge Cases
Tighter runtime controls often increase latency and workflow friction, so organisations must balance containment against operational speed. That tradeoff becomes sharper in customer-facing agents, analyst copilots, and multi-agent systems where every extra approval can slow the business process. There is no universal standard for this yet, but best practice is evolving toward context-aware policy decisions rather than one-time prompt sanitisation.
Edge cases usually appear when the attacker hides instructions in trusted sources: a support ticket, a CRM note, a file attachment, or a retrieved web page. In those environments, “prompt filtering” gives a false sense of safety because the agent is still reading attacker-controlled material inside a legitimate workflow. The safer pattern is to classify data before retrieval, constrain what the agent can remember, and block tool use unless the request survives a fresh policy check. NHIMG’s Analysis of Claude Code Security and the Moltbook AI agent keys breach both reinforce the same lesson: once secrets and execution authority are exposed to the agent, prompt injection becomes a control-plane problem, not a text-analysis problem. For high-risk environments, combine this with the OWASP Top 10 for Agentic Applications 2026 and the NIST AI Risk Management Framework to formalise review, testing, and ownership.
Standards & Framework Alignment
This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.
OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.
| Framework | Control / Reference | Relevance |
|---|---|---|
| OWASP Agentic AI Top 10 | LLM-01 | Prompt injection is a core agentic AI abuse path requiring runtime guardrails. |
| CSA MAESTRO | AID-03 | MAESTRO addresses agent threat modeling and execution controls for autonomous workflows. |
| NIST AI RMF | AI RMF governance fits controls for accountable, risk-based agent operation. |
Model agent paths, restrict tools, and evaluate each action against policy before execution.
Related resources from NHI Mgmt Group
Deepen Your Knowledge
Reviewed and updated by the NHIMG editorial team on May 31, 2026.
NHI Mgmt Group — the #1 independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org