Why do safety filters not guarantee AI agent security?

Why This Matters for Security Teams

Safety filters are designed to block disallowed content, not to prove that an agent will behave safely under pressure. That distinction matters because an AI agent can be technically “safe” in a chat sense and still be dangerous if it can be nudged into opening files, calling tools, forwarding data, or chaining actions from untrusted input. Current guidance from the OWASP Agentic AI Top 10 and the NIST AI Risk Management Framework treats this as a control problem, not a prompt-style moderation problem.

For NHI security, the issue is that agents operate with execution authority. A filter may stop harmful wording while still allowing hidden instructions embedded in a webpage, document, ticket, or email to steer the agent toward credential exposure or privilege misuse. That is why security teams need action-resistance testing, runtime authorization, and workload identity, not just content review. NHIMG has shown the same pattern in real-world research on agent and credential compromise, including the OWASP NHI Top 10 coverage and the Moltbook AI agent keys breach analysis. In practice, many security teams discover the gap only after an agent has already acted on untrusted context, rather than through intentional security testing.

How It Works in Practice

Safety filters usually inspect the generated output or the immediate prompt. Security for agents must inspect the whole action path: what context entered the model, what tools were available, what identity the agent used, and what policy allowed the action at runtime. That means treating the agent like a workload with its own identity, not like a user with a static role. The emerging pattern is context-aware authorization, short-lived credentials, and per-request policy evaluation.

In practice, teams often combine these controls:

Issue ephemeral credentials per task, with automatic revocation after completion.

Use workload identity, such as OIDC-backed service identity or SPIFFE/SPIRE-style proof of workload identity, to bind actions to a specific agent instance.

Evaluate policy at request time using policy-as-code rather than relying only on pre-defined RBAC.

Segment tools so the agent can only reach the minimum set required for the current objective.

Test against prompt injection, tool abuse, and indirect instruction smuggling, not only toxic output.

This matters because an agent can refuse unsafe language yet still comply with a malicious instruction that says, for example, “summarise this file and send it to the ticketing system,” if the runtime policy allows it. That is why the CSA MAESTRO agentic AI threat modeling framework and MITRE’s MITRE ATLAS adversarial AI threat matrix are more useful than content-only controls: they focus on attack paths, not just outputs. NHIMG research on the State of Non-Human Identity Security also shows that organisations still struggle with visibility and over-privilege, which becomes more dangerous when the workload can decide how to act. These controls tend to break down in tool-rich environments with long-lived tokens and broad connector access because the agent’s effective privilege becomes much larger than the filter’s scope.

Common Variations and Edge Cases

Tighter action controls often increase operational overhead, so organisations must balance safety against latency, developer friction, and false denials. There is no universal standard for this yet, and current guidance suggests different levels of control depending on whether the agent only drafts text, reads internal data, or can execute transactions.

Some edge cases are especially important. First, an agent with read-only access can still create security impact if it is allowed to summarise sensitive data into an uncontrolled destination. Second, safety filters may still help with abuse detection, but they should be treated as a secondary layer, not a security boundary. Third, indirect prompt injection from web pages, PDFs, issue trackers, and shared drives is often harder to stop than direct user prompts because the malicious instruction is embedded in ordinary content. Fourth, multi-agent workflows can amplify mistakes when one agent trusts another agent’s output as if it were verified input.

For mature programmes, the practical test is simple: if a hidden instruction can change what the agent is allowed to do, the control plane is still too weak. That is why NHI governance for agents must prioritise runtime policy, scoped credentials, and observable action logs over model moderation alone, especially in environments that connect to email, files, and production systems.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A01	Covers prompt injection and action abuse that filters do not stop.
CSA MAESTRO	MT-3	Addresses agent threat modeling across tools, identity, and actions.
NIST AI RMF		Governance and risk mapping are needed beyond content moderation.

Test agent tool paths for indirect prompt injection and block unsafe actions at runtime.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Why do safety filters not guarantee AI agent security?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group