What do teams get wrong about keyword filtering for prompt injection?

Why This Matters for Security Teams

Keyword filtering is attractive because it is simple to deploy, but prompt injection is not a simple string-matching problem. Attackers can hide instructions in paraphrase, translation, formatting tricks, tool output, or multi-turn coercion, so a filter that only hunts for banned terms creates false confidence. Current guidance from the OWASP Agentic AI Top 10 treats instruction handling as a trust and control problem, not a lexical one.

This is why NHI Management Group places emphasis on the broader execution context, not just the surface prompt. The same operational mistake appears in identity security: organisations often focus on visible artefacts while missing the runtime behaviour that actually creates risk. NHIMG notes that 91.6% of secrets remain valid five days after the targeted organisation is notified, which shows how quickly a narrow control can lag behind attacker activity.

In practice, many security teams encounter prompt injection only after an agent has already followed a malicious instruction path rather than through intentional testing of the controls they thought were sufficient.

How It Works in Practice

Effective filtering starts by accepting that prompts are not just user text. In agentic systems, instructions can arrive through chat history, retrieved documents, tool responses, embedded metadata, or even content the model itself synthesised earlier. A keyword blocklist may catch obvious abuse, but it will miss encoded instructions, role-play framing, and indirect persuasion that changes the model’s behaviour without using the banned terms.

Practitioner guidance is shifting toward layered controls. That includes separating trusted system instructions from untrusted input, normalising and inspecting content before it reaches the model, and applying policy checks at decision time rather than relying on static rules alone. The OWASP Agentic AI Top 10 and NHIMG’s Ultimate Guide to Non-Human Identities both reinforce the need to treat autonomous execution paths as security boundaries.

Use semantic and contextual inspection, not only keyword blocks.

Reduce tool exposure so the model cannot act on untrusted instructions by default.

Apply allowlists for tools, data sources, and actions rather than broad text-based bans.

Log prompt lineage so the team can trace where a malicious instruction entered the workflow.

Test against prompt injection variants, including encoded, translated, and multi-turn attacks.

For organisations formalising this, intent-aware controls align with broader AI governance approaches described in the NIST AI Risk Management Framework, especially where runtime decisions need to account for context, trust level, and downstream action. These controls tend to break down when untrusted retrieval sources are mixed directly into system prompts because the model can no longer reliably distinguish instructions from content.

Common Variations and Edge Cases

Tighter filtering often increases operational overhead, requiring organisations to balance coverage against false positives and the risk of blocking legitimate content. That tradeoff becomes sharper when teams support multilingual users, code-heavy workflows, or agents that must summarise external documents. Guidance is evolving here, and there is no universal standard for how much semantic inspection is enough.

One common edge case is tool output. If a retrieval system or API returns text that contains hidden instructions, a keyword filter on the original user prompt does nothing. Another is indirect prompt injection inside documents, tickets, or web pages that the agent reads later as if they were trusted evidence. In those cases, the control failure is architectural, not just a missed keyword. The NHI Management Group research on NHIs is useful here because it frames the real issue as ungoverned execution privilege, not merely untrusted text.

Teams also get tripped up by the assumption that one filter can protect every model interaction. Best practice is evolving toward tiered controls: content screening for obvious abuse, context policy for instruction hierarchy, and runtime authorization for any action that can touch tools, secrets, or sensitive data. In environments where agents can chain tools across multiple steps, static keyword filtering becomes especially fragile because the harmful intent may only emerge after several benign-looking turns.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	LLM-01	Prompt injection is a core agentic AI abuse path.
CSA MAESTRO	MA-02	MAESTRO addresses agent governance and prompt-based attack paths.
NIST AI RMF		AI RMF covers contextual risk management for model-driven decisions.

Classify all untrusted instructions and block unsafe tool actions at runtime.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

What do teams get wrong about keyword filtering for prompt injection?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group