What do security teams get wrong about prompt filtering for AI agents?

Why Prompt Filtering Misses the Real Risk

Security teams often assume prompt filtering is the front line for AI agents, but the real issue is the agent’s autonomous, goal-driven behaviour. A filter can reject obvious jailbreak text and still miss a malicious instruction embedded in a document, ticket, or knowledge base that the agent is already allowed to read. That is why the OWASP NHI Top 10 and OWASP Agentic AI Top 10 focus on agentic abuse paths, not just malicious text strings.

The control gap is also visible in field data. SailPoint reports that 80% of organisations say their AI agents have already acted beyond intended scope, including unauthorised access, sensitive data sharing, or credential exposure. That is a governance problem, not a prompt hygiene problem, and it is exactly why prompt filtering cannot be treated as a complete control layer. Current guidance from NIST AI Risk Management Framework points toward broader lifecycle controls, while the AI Agents: The New Attack Surface report shows why visibility into agent actions matters more than text inspection alone.

In practice, many security teams discover prompt filtering failed only after an agent has already followed a hidden instruction through an approved connector and valid credentials.

How It Works in Practice

Effective defence starts by treating the agent as a workload with identity, scope, and runtime policy, not as a chat interface. Prompt filtering still has value as a first screen, but it must sit alongside intent-based authorisation, JIT credential issuance, and short-lived secrets. An agent should receive only the minimum access needed for the current task, and that access should expire automatically when the task completes. For autonomous systems, static RBAC is too coarse because the exact sequence of actions is not fixed in advance.

Practitioners increasingly pair workload identity with policy evaluation at request time. That means the platform checks what the agent is trying to do, which connector it wants to use, what data it is asking for, and whether that action matches the approved goal. This is where CSA MAESTRO agentic AI threat modeling framework and NIST AI Risk Management Framework are useful: they push teams toward continuous governance rather than one-time approval. For implementation detail, teams can also study Analysis of Claude Code Security alongside the OWASP Agentic Applications Top 10.

Use workload identity for the agent, not shared service credentials.

Issue JIT credentials per task with tight TTL and automatic revocation.

Validate connector use, data scope, and tool invocation at runtime.

Log every action so prompt injection attempts can be correlated with execution.

These controls tend to break down when an agent can chain multiple tools across loosely governed SaaS connectors because the policy boundary becomes fragmented.

Common Variations and Edge Cases

Tighter filtering often increases friction, requiring organisations to balance user experience against containment. That tradeoff matters most in environments with long-running agents, human-in-the-loop approvals, or agents that must read untrusted content as part of their job. In those cases, there is no universal standard for exactly how much prompt content should be filtered versus sandboxed, so best practice is evolving toward layered controls rather than single-point prevention.

One common edge case is indirect prompt injection inside legitimate enterprise content. Another is secret exposure, where the agent is not tricked by wording at all but is instead abused because a connector or token is overly permissive. The AI LLM hijack breach and DeepSeek breach illustrate how quickly exposed secrets and broad access become operational risk, especially when the agent can act faster than a human reviewer. For broader threat mapping, MITRE ATLAS adversarial AI threat matrix helps teams separate model abuse from access abuse, which are often conflated in incident reviews.

The practical answer is not “filter harder.” It is to reduce standing privilege, constrain tool access, and assume the agent will eventually encounter hostile content. In environments with highly dynamic workflows, that means prompt filtering remains helpful, but only as one input to a broader zero trust control set.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A01	Prompt filtering alone misses agentic abuse and tool-chain exploitation.
CSA MAESTRO	T1	MAESTRO models agent workflows, connectors, and runtime trust decisions.
NIST AI RMF		AI RMF addresses governance gaps beyond text filtering.

Treat agent prompts, tools, and actions as one attack surface and enforce runtime controls.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

What do security teams get wrong about prompt filtering for AI agents?

Why Prompt Filtering Misses the Real Risk

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group