Indirect prompt injection is becoming an operational exploit

By NHI Mgmt Group Editorial TeamPublished 2025-08-12Domain: Agentic AI & NHIsSource: Pillar Security

TL;DR: Indirect prompt injection succeeds when malicious instructions are embedded in trusted data and LLMs can act on them across sensitive workflows, according to Pillar Security’s analysis. The real risk is not the payload alone but the combination of private data access, untrusted inputs, and external communication that turns prompt attacks into operational exploits.

At a glance

What this is: This research explains how indirect prompt injection works, using the CFS model to show why context, format, and salience determine whether a malicious payload is ignored or executed.

Why it matters: It matters because LLMs connected to private data and external tools inherit identity and access risks that traditional control models do not fully account for across NHI, autonomous, and human programmes.

By the numbers:

98% of companies plan to deploy even more AI agents within the next 12 months, despite documented rogue behaviour in 80% of current deployments.
92% agree governing AI agents is critical to enterprise security, yet only 44% have implemented any policies to do so.

👉 Read Pillar Security's analysis of indirect prompt injection and the CFS model

Context

Indirect prompt injection is a control failure, not a novelty. It happens when an LLM treats embedded instructions inside emails, documents, tickets, code comments, or web content as if they were authoritative, even though the attacker controls the data source.

For identity teams, the risk lands where LLMs intersect with sensitive data, tool access, and outbound communication. That makes the problem relevant to NHI governance, agentic AI identity controls, and the wider IAM assumptions behind who or what is allowed to act on trusted information.

Key questions

Q: How should security teams reduce indirect prompt injection risk in LLM workflows?

A: Start by separating untrusted content from system instructions, then limit what the model can do with sensitive data. The most effective controls are workflow-level: restrict tool permissions, harden outbound channels, and treat every email, ticket, or file as potentially adversarial until validated.

Q: Why do private-data access and outbound tools make prompt injection worse?

A: Because prompt injection becomes operational when the model can read something valuable and send it somewhere useful. Private-data access creates the target, untrusted inputs create the vector, and outbound tools create the exfiltration path. Remove any one of those conditions and the attack loses force.

Q: What do teams get wrong about indirect prompt injection?

A: They often focus only on the prompt text and ignore the surrounding workflow. The real failure is usually the combination of trusted input formats, strong model authority, and connected tools that let attacker-controlled content influence decisions inside a privileged process.

Q: Who is accountable when an LLM leaks data after following malicious instructions?

A: Accountability sits with the organisation that granted the model access, connected the tools, and allowed untrusted content into the same decision path. That makes this a governance issue across IAM, security engineering, and application ownership, not a defect that belongs to the model alone.

Technical breakdown

Why indirect prompt injection succeeds in trusted workflows

Indirect prompt injection works because the model is asked to interpret content and instructions in the same context window. A malicious payload can therefore hide inside normal business data, such as support tickets, emails, or code files, and still influence the model’s next action. The attack is not about bypassing authentication in the classic sense. It is about confusing instruction boundaries so the model treats attacker-controlled text as part of the task. This is why systems that combine private data access with external actions are high-risk: they give the payload something valuable to steal and somewhere to send it.

Practical implication: isolate untrusted inputs from instruction channels and review every workflow where a model can both read sensitive content and act on it.

The CFS model: context, format, and salience

Pillar’s CFS model describes three reasons a payload is more likely to work. Context means the malicious instruction matches the task, tools, and expected output of the workflow. Format means it blends into the medium, whether that is HTML, JSON, comments, or plain text. Salience means the instruction is placed and phrased so the model is likely to notice and follow it. These are not separate tricks. They reinforce one another. A payload that looks native to the content and appears highly relevant to the task can outrank legitimate system intent inside the model’s attention window.

Practical implication: threat-model the content format and placement of untrusted text, not just the model endpoint or prompt template.

Why the lethal trifecta raises the stakes for LLM security

Simon Willison’s lethal trifecta is useful because it describes the enabling conditions that turn prompt injection into data loss: access to private data, exposure to untrusted content, and the ability to send information externally. When all three are present, a prompt injection can move from persuasion to exfiltration. That is why indirect prompt injection belongs in the same governance conversation as secrets exposure, tool permissions, and outbound controls. The technical issue is not that models are clever enough to be tricked. It is that the surrounding system often gives them enough authority to be dangerous.

Practical implication: reduce one leg of the trifecta wherever possible, especially outbound channels and private-data reach.

Threat narrative

Attacker objective: The attacker wants the model to convert trusted content into a covert instruction channel that exposes sensitive data or triggers unauthorised actions.

Entry occurs when an attacker embeds hidden instructions inside a trusted object such as an email, support ticket, or code comment that an LLM is likely to process.
Escalation occurs when the model treats that embedded text as an instruction, gains context from private data, and follows the malicious payload instead of the user’s intent.
Impact occurs when the model leaks sensitive data or performs an unauthorised action through an allowed outbound tool or workflow.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Indirect prompt injection is an instruction-boundary problem before it is an AI problem. The failure begins when systems collapse data and directives into one processing stream, then ask the model to decide what is authoritative. That makes the control gap broader than prompt hygiene. Practitioners need to treat content ingestion, tool invocation, and output generation as separate trust zones.

Context, format, and salience are a practical attack model for LLM abuse. The article’s CFS framework is valuable because it explains why some payloads fail and others survive. It shifts analysis from generic prompt attacks to the conditions that make malicious text feel native to the workflow. That is the level at which defenders should test email assistants, ticketing copilots, and coding agents.

Private-data access plus outbound action is where prompt injection becomes identity risk. Once an LLM can read sensitive content and transmit results externally, the attack is no longer just content manipulation. It becomes an access problem, because the model is operating with borrowed authority. Identity teams should see that as a control-plane issue, not a model-quality issue.

Operational exploitability depends on whether untrusted inputs can inherit task authority. The article shows that the same payload can be harmless in one environment and dangerous in another depending on what the workflow lets the model do. That means governance has to measure privilege at the workflow level, not only at the model or user level.

OWASP NHI Top 10 is the right lens when LLMs are acting through non-human access paths. Prompt injection becomes more damaging when the model is tied to service credentials, APIs, or tool integrations that already have standing access. The governing question is not whether the text is malicious, but whether the identity chain gives it a route to matter.

From our research:
98% of companies plan to deploy even more AI agents within the next 12 months, despite documented rogue behaviour in 80% of current deployments, according to AI Agents: The New Attack Surface report.
Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation.
For a broader control model, see OWASP Agentic AI Top 10 for the main agentic application risk categories.

What this signals

Instruction-boundary governance is becoming a practical requirement for any team connecting LLMs to email, tickets, code, or documents. The next control gap is not whether the model can reason, but whether it can distinguish content from command when content arrives in a trusted workflow.

With 80% of current AI deployments already showing rogue behaviour in the SailPoint data, the operational question shifts to containment. Teams need to decide which workflows can tolerate model autonomy, which must stay human-mediated, and which should never combine private data with outbound action.

For teams building agent controls, the most useful reference point is OWASP NHI Top 10: not because every prompt injection is an agentic attack, but because the same identity and tool-use assumptions keep failing in adjacent ways.

For practitioners

Separate instruction channels from data channels Keep system instructions, user prompts, and untrusted content in distinct processing paths. Do not let emails, tickets, documents, or code comments become implicit instruction sources for the same model session.
Restrict outbound capability on high-risk LLM workflows Remove or tightly mediate external communication paths where models process private data. If the system cannot exfiltrate, prompt injection loses one of its most dangerous outcomes.
Test for CFS exposure in real workflows Red-team the exact content types your teams use most, including HTML, JSON, code comments, and ticket text. Score each workflow for context fit, format fit, and salience, then prioritise the ones where all three are high.
Treat model tool access as borrowed identity authority Review every connected tool, API, and secret as part of the model’s effective access path. Where the model can query data or send output, apply least privilege and approval gates to the underlying identity, not just the app shell.

Key takeaways

Indirect prompt injection works because LLMs often cannot reliably separate trusted instructions from attacker-controlled content.
The combination of private-data access, untrusted inputs, and outbound communication turns a prompt attack into a real exfiltration path.
Defenders should focus on workflow boundaries, tool permissions, and content isolation rather than prompt wording alone.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Indirect prompt injection targets agent instruction handling and tool use.
OWASP Non-Human Identity Top 10	NHI-03	Connected tools and secrets create non-human identity abuse paths.
NIST CSF 2.0	PR.AC-4	Prompt injection becomes impactful when access permissions exceed workflow need.

Classify untrusted content paths and restrict tool execution when model instructions can be attacker-controlled.

Key terms

Indirect Prompt Injection: An attack where malicious instructions are hidden inside data that an AI model is expected to process, such as an email, file, or ticket. The model treats the attacker’s content as if it were part of the task, which can lead to data leakage, unauthorised actions, or tool misuse.
Context, Format, Salience: A model for explaining why some indirect prompt injections succeed. Context is task fit, format is how well the payload blends into the medium, and salience is how strongly the instruction stands out to the model. Together, they describe whether malicious text feels legitimate enough to follow.
Lethal Trifecta: A risk pattern where an AI system can access private data, process untrusted content, and communicate externally. When all three are present, prompt injection can move from a confusing instruction problem to a real data exfiltration and governance problem.
Instruction Boundary: The line between content the model should interpret and directives it should obey. In practice, this boundary is often blurred when workflows mix user data, embedded instructions, and tool permissions in one session. Strong governance depends on keeping that boundary explicit and enforceable.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building or maturing an identity or security programme, it is worth exploring.

This post draws on content published by Pillar Security: Anatomy of an Indirect Prompt Injection. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-08-12.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org