What do organisations get wrong about filtering malicious prompts?

Why This Matters for Security Teams

Prompt filtering fails when teams assume malicious intent will always appear as obvious unsafe wording. In practice, attackers hide instructions inside quoted content, role-play, translated text, or encoded fragments that survive simple keyword checks. That matters because the model processes structure, not just suspicious words, and a filter that only scans for banned phrases leaves plenty of room for manipulation.

This is why current guidance increasingly treats prompt screening as a content-inspection and normalization problem, not a simple blocklist problem. NIST’s NIST Cybersecurity Framework 2.0 is not a prompt-security standard, but its emphasis on risk-based controls and continuous monitoring maps well to this issue. For broader identity and access context, NHI Management Group notes that only 5.7% of organisations have full visibility into their service accounts in the Ultimate Guide to NHIs, which is a reminder that weak visibility often shows up first in how inputs, identities, and tool access are handled together.

In practice, many security teams encounter prompt abuse only after a model has already followed a hidden instruction chain, rather than through intentional detection design.

How It Works in Practice

Effective defence starts by assuming the prompt surface is adversarial. Teams should normalise text before inspection, then evaluate both the literal instruction and the surrounding context. That means decoding common obfuscation, collapsing spacing tricks, handling mixed scripts, and checking whether content is quoted, embedded, or being treated as untrusted data.

A useful mental model is to separate three questions: what is the user asking, what content is the model being asked to process, and what tools or data sources can the model reach if it complies. If the system treats pasted documents, web pages, emails, or tickets as trusted instructions, attackers can smuggle malicious directives through content that appears routine. This is especially important in RAG pipelines, agentic workflows, and any interface that blends natural-language commands with retrieved material.

Normalise input before detection, including case, spacing, Unicode variants, and obvious encodings.

Inspect for embedded instructions inside quoted or third-party text, not only in the user’s direct message.

Score prompts in context, using surrounding conversation and tool permissions, not isolated phrase matches.

Apply allow and deny logic to actions, not just words, so the model cannot be coaxed into unsafe tool use.

This is aligned with the broader NHI governance view in the Ultimate Guide to NHIs, where visibility and controlled access are treated as first-class security requirements rather than afterthoughts. Implementation guidance also fits the direction of the NIST Cybersecurity Framework 2.0, especially around detecting anomalies and enforcing policy at the point of decision. These controls tend to break down in high-throughput customer support and open-ended agent workflows because legitimate nested content looks very similar to malicious embedded instructions.

Common Variations and Edge Cases

Tighter filtering often increases false positives and review overhead, requiring organisations to balance safety against usability and response time. That tradeoff is especially visible when multilingual users, code snippets, or long documents are part of normal business flow.

There is no universal standard for this yet, but best practice is evolving toward layered controls rather than a single prompt firewall. Teams often miss three edge cases: benign-looking instructions buried in pasted content, adversarial text that is only harmful after translation or decoding, and indirect prompt injection delivered through retrieval systems or connected apps. The last case is the hardest, because the prompt itself may look harmless while the retrieved content carries the attack.

Another common mistake is assuming a one-time scan is enough. If an agent can re-read content, call tools, or chain multiple steps, the attack surface is dynamic. That means the filtering problem is not only about the first prompt, but also about every downstream prompt assembled from retrieved data, prior conversation, or tool output. Organisations that understand this usually pair content screening with strict tool permissions and context-aware policy checks. Organisations that do not tend to discover the gap after the model has already executed an unsafe instruction path.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A01	Prompt injection and unsafe instruction handling are core agentic AI risks.
CSA MAESTRO	MAESTRO-04	Covers agent workflow abuse through hidden instructions and unsafe execution paths.
NIST AI RMF		Addresses governance and monitoring for AI risks, including prompt manipulation.

Classify prompts and retrieved content as untrusted, then enforce context-aware controls before any tool action.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

What do organisations get wrong about filtering malicious prompts?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group