Prompt injection exposes the trust model behind LLM applications

By NHI Mgmt Group Editorial TeamPublished 2026-03-17Domain: Agentic AI & NHIsSource: WorkOS

TL;DR: Prompt injection has become a top-tier LLM risk because models cannot reliably separate trusted instructions from untrusted text, and attacks now span direct, indirect, multimodal, and agentic tool abuse, according to WorkOS and the OWASP Top 10 for LLM Applications. The deeper issue is not just filtering malicious prompts, but governing systems that treat language itself as executable context.

At a glance

What this is: Prompt injection is a core LLM security flaw in which untrusted text can override intended instructions and drive unintended model or agent actions.

Why it matters: It matters because the same trust boundary failure affects NHI, autonomous agent, and human-facing workflows wherever LLMs can read external content and trigger tools.

By the numbers:

Prompt injection has been ranked the number one vulnerability on the OWASP Top 10 for LLM Applications since 2025.
Research has shown that as few as five strategically poisoned documents in a RAG knowledge base can manipulate AI responses 90% of the time.
Joint research from OpenAI, Anthropic, and Google DeepMind found that sophisticated attackers can bypass published defenses with over 90% success rates when given enough attempts.

👉 Read WorkOS's guide to prompt injection attacks and defences

Context

Prompt injection is an LLM-specific attack in which hostile text is treated as instructions instead of data. The primary keyword here is prompt injection, and the governance problem is the absence of a reliable boundary between system instructions, user content, and retrieved material.

For IAM and security teams, the issue is not limited to chatbots. Once an LLM can read documents, emails, web pages, or tool outputs, the same trust failure can reach NHI credentials, human workflows, and agentic actions in production.

The article's starting position is typical for teams adopting LLMs quickly: the model is connected to tools and external context before the organisation has defined a durable control boundary for untrusted input.

Key questions

Q: How should security teams handle prompt injection in LLM applications?

A: Security teams should treat prompt injection as an application and identity boundary problem, not just a content filtering problem. The practical response is layered control: isolate untrusted inputs, minimise tool permissions, gate sensitive actions, and log model behaviour for review. If the model can act, then the permissions attached to it define the real risk.

Q: Why do prompt injection attacks create so much risk for AI agents?

A: Prompt injection is risky for AI agents because the model can be steered into using tools and credentials that already exist in the workflow. Once action-taking is possible, the attack is no longer about a wrong answer. It becomes a path to data exposure, configuration change, or code execution through delegated access.

Q: What do organisations get wrong about defending against prompt injection?

A: The common mistake is relying on input filtering alone. Prompt injection can be hidden in documents, emails, images, and code comments, so rewording or obfuscation often defeats simple filters. Organisations also overestimate the model's ability to refuse malicious text and underestimate the value of runtime authorisation controls.

Q: How do you know if an LLM workflow is too privileged?

A: An LLM workflow is too privileged when a successful injection could reach systems or actions the user should not control in the first place. If the assistant can modify settings, access broad data, or trigger destructive API calls without explicit approval, the permission scope is too wide for safe operation.

Technical breakdown

Why prompt injection breaks the instruction boundary

Prompt injection works because current LLMs ingest system prompts, user input, and retrieved content as one context stream. They do not enforce a hard privilege boundary between instructions and data, so persuasive or hidden text can influence behaviour even when it should be treated as untrusted. This is why the problem is architectural rather than purely content-based. Filtering helps, but it cannot create a deterministic separation the model itself does not understand.

Practical implication: treat prompt boundaries as an application control problem, not a model-only problem.

Indirect and multimodal prompt injection in RAG pipelines

Indirect injection embeds malicious instructions inside content the model later consumes, such as emails, documents, or retrieved knowledge. Multimodal injection extends the same pattern into images, audio, and other inputs, where hidden payloads can evade text-only checks. In RAG systems, poisoned documents can persist as trusted context and repeatedly alter outputs. The risk rises when the model is allowed to act on what it reads.

Practical implication: isolate untrusted retrieval sources and tag external content before it reaches the model.

Agentic prompt injection and tool misuse

The highest-risk variant appears when an LLM has tool access. At that point, a successful injection is not just a bad answer, it is a path to unauthorized action through the credentials and permissions already attached to the agent or user session. The model may be tricked into modifying settings, calling APIs, or triggering code execution. In identity terms, the blast radius is defined by the permissions granted to the agent, not by the elegance of the prompt attack.

Practical implication: constrain tool permissions and require explicit approval for sensitive actions.

Threat narrative

Attacker objective: The attacker wants the model or agent to ignore intended instructions and carry out actions, disclosures, or code execution that the user did not approve.

Entry occurs when malicious instructions are placed directly into a prompt, hidden inside retrieved content, or embedded in a document, email, image, or code comment that the model will later ingest.
Credential access or abuse follows when the model is induced to use the tools, tokens, or delegated permissions attached to the session, turning a text attack into an authorised action path.
Impact occurs when the agent executes unintended operations such as data exfiltration, configuration changes, or remote code execution through its connected tools.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Prompt injection is not just an LLM vulnerability, it is an identity boundary failure. The article shows that mixed-trust text becomes executable context once an LLM can act on it. That means the security problem is not only content safety, but whether the system can distinguish instruction sources before a tool call is made. Practitioners should stop treating the model as a neutral processor and start treating it as a policy-bearing execution point.

Context window pollution: Large context windows increase the volume of text that can compete with trusted instructions, which makes the model easier to steer through buried or repeated adversarial content. This is not a simple scanning problem because the attack surface is conversational, not token-based alone. The implication is that long-context architectures need governance assumptions that account for instruction dilution, not just larger buffers.

Least privilege is the decisive control plane for agentic LLM risk. The article's own examples show that once a model can call tools, the blast radius is set by credential scope, not prompt quality. Short-lived, narrowly scoped access and deterministic approval gates matter because they reduce what a successful injection can reach. Practitioners should evaluate every LLM integration as an identity-bound workflow, not a standalone AI feature.

Fine-grained authorization must sit between the model and the action. Prompt filtering cannot reliably distinguish benign from malicious intent across paraphrase, obfuscation, and multimodal payloads. That pushes control enforcement into the runtime authorisation layer, where actions can be checked against user and resource relationships before execution. The practical conclusion is that identity governance, not prompt tuning, is what limits cross-user and cross-resource damage.

OWASP NHI and agentic guidance now overlap materially. Once an LLM can initiate actions through tools, its permissions behave like a non-human identity with dynamic input exposure. The best governance model is no longer a separate AI bucket, but a shared control framework that covers NHI credentials, delegated actions, and human approval boundaries together. Teams should align security reviews across AI, IAM, and PAM rather than manage them as isolated programmes.

From our research:
91.6% of secrets remain valid five days after the targeted organisation is notified, showing a critical gap in remediation procedures, according to the Ultimate Guide to NHIs.
Only 5.7% of organisations have full visibility into their service accounts, which means most identity teams cannot reliably see every machine credential in circulation.
That visibility gap makes 52 NHI Breaches Analysis the next resource to review when prompt injection can pivot into NHI misuse and delegated action abuse.

What this signals

Prompt injection is becoming an identity operations problem as much as an AI security problem. When assistants can read external content and act on tools, the control question shifts from "is the prompt safe" to "what can this identity reach if it is steered." That is why teams should align AI runtime controls with NHI governance, PAM approvals, and scoped delegation policies.

Context poisoning is the named concept teams should watch. It describes the steady accumulation of untrusted text inside the model's working context until instructions are diluted or overridden. The governance implication is that retrieval quality, content tagging, and action gating have to be designed together, not as separate layers.

With 80% of identity breaches involving compromised non-human identities such as service accounts and API keys, per the Ultimate Guide to NHIs, the next step is to map LLM tool permissions into the same control inventory used for machine identities and other delegated actors.

For practitioners

Separate trusted instructions from untrusted content Use structured delimiters, server-side system prompts, and explicit content tagging so retrieved documents, email bodies, and tool outputs are never treated as instructions. Review every place where external text enters the context window and make the trust boundary explicit in code and policy.
Minimise agent permissions to the narrowest action set Issue short-lived credentials with only the API scopes and resource relationships the workflow actually needs. Avoid admin tokens for assistants, and define tool-level allowlists so a successful injection cannot expand into unrelated systems.
Gate destructive actions behind deterministic approval Require explicit human confirmation before deleting data, modifying configurations, sending messages, or initiating external transfers. Do not let the model decide when a high-stakes action is safe; enforce that decision in the application layer.
Instrument both input and output monitoring Log prompts, tool calls, and model outputs, then alert on unexpected actions, sensitive data exposure, or anomalous response patterns. Add review workflows that let security teams reconstruct how a prompt moved from text to action.
Red-team LLM workflows on a recurring cadence Test direct injection, indirect injection, and multimodal payloads against the actual application stack, not just the base model. Use findings to tune allowlists, approval gates, and detection rules in the surrounding control plane.

Key takeaways

Prompt injection works because LLMs lack a hard boundary between instructions and untrusted data, so the vulnerability is architectural rather than cosmetic.
Once an LLM can call tools, the damage from prompt injection is determined by the permissions attached to the workflow, not by prompt quality alone.
Effective defence requires layered controls, with least privilege, approval gates, and runtime authorisation doing more work than prompt filtering by itself.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Prompt injection and tool misuse are core agentic AI risks in the article.
OWASP Non-Human Identity Top 10	NHI-03	Short-lived, scoped credentials are central to limiting agent blast radius.
NIST AI RMF		The article's governance and monitoring concerns align with AI risk management.

Define accountability, monitoring, and escalation for LLM-driven workflows before production use.

Key terms

Prompt Injection: Prompt injection is an attack in which untrusted text alters how a language model behaves, causing it to ignore intended instructions or take unintended actions. The attack exploits the model's inability to enforce a hard boundary between data and control text.
Indirect Prompt Injection: Indirect prompt injection places malicious instructions inside content the model later reads, such as emails, documents, or retrieved web pages. The model can then treat hidden payloads as legitimate instructions, especially in retrieval-augmented systems where external text is mixed with trusted context.
Context Window Pollution: Context window pollution is the gradual accumulation of irrelevant, conflicting, or adversarial text inside an LLM's working context. As the context grows, trusted instructions compete with more untrusted material, which can dilute policy signals and increase the chance of instruction override.
Tool Misuse: Tool misuse occurs when a model or agent uses connected APIs, scripts, or system functions in ways the user did not intend. In practice, it is the point where a prompt attack becomes an identity and authorisation problem because the model can act through delegated access.

Deepen your knowledge

Prompt injection, delegated tool use, and least privilege for AI agents are core topics in the NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building LLM workflows with shared credentials or external retrieval, this is a relevant place to start.

This post draws on content published by WorkOS: Prompt injection attacks and how to defend against them. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-03-17.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org