Indirect prompt injection exposes a new AI identity attack surface

By NHI Mgmt Group Editorial TeamPublished 2026-03-06Domain: Agentic AI & NHIsSource: WitnessAI

TL;DR: Indirect prompt injection lets attackers hide malicious instructions inside emails, documents, web pages, and knowledge bases that AI systems already trust, and agentic AI can turn that into unauthorized actions with system credentials, according to WitnessAI. Existing guardrails, pattern matching, and application controls provide only partial coverage because the model cannot reliably separate trusted instructions from untrusted content.

At a glance

What this is: Indirect prompt injection hides malicious instructions inside trusted content sources that AI systems consume, turning normal retrieval into an attack path for altered outputs, data exfiltration, and unauthorised actions.

Why it matters: IAM teams need to treat AI inputs, outputs, and tool chains as identity-bearing control points because indirect injection can convert ordinary content governance failures into privilege abuse across NHI, autonomous, and human programmes.

By the numbers:

U.S. breach costs reached $10.22 million while Shadow AI added $670,000 to average breach costs for organisations with high levels of unsanctioned AI usage.
97% of organisations that reported an AI-related breach lacked proper AI access controls.
Only 34% performed regular audits for unsanctioned AI.

👉 Read WitnessAI's analysis of indirect prompt injection and AI identity risk

Context

Indirect prompt injection is a control failure in which untrusted content is treated as instruction material because the model processes both through the same natural-language channel. For identity teams, that means the security boundary is no longer just the application session. It now includes the data sources, retrieval paths, and downstream actions that AI systems can trigger.

The governance problem becomes sharper when AI systems operate with delegated credentials, tool access, or persistent memory. In that setting, a poisoned document or page is not only a content issue. It can become an identity and authorisation issue that crosses from information handling into execution control.

Key questions

Q: What breaks when indirect prompt injection is not controlled in AI systems?

A: Indirect prompt injection breaks the assumption that retrieved content is safe to use as instruction material. Once malicious text enters the model context, the system may alter responses, leak data, or trigger tools with delegated permissions. The core failure is boundary collapse between data and directive, which turns ordinary content ingestion into an execution risk.

Q: Why do AI agents make indirect prompt injection more dangerous for enterprises?

A: AI agents make indirect prompt injection more dangerous because the model can take actions, not just generate text. If the injected instruction reaches an agent with tool access, the attacker can steer APIs, workflows, and downstream systems through the agent's own permissions. That changes the issue from misinformation to privilege abuse.

Q: How do security teams know whether intent-based classification is working for AI content?

A: Teams should test whether the control catches semantically disguised requests, multilingual payloads, hidden text, and transformed instructions that do not match known signatures. If the system only blocks obvious phrases, it is not detecting intent. Effective programmes measure whether unsafe content is intercepted before retrieval, response generation, or tool execution.

Q: Who is accountable when an AI system acts on injected content?

A: Accountability sits with the organisation that allowed untrusted content, retrieval paths, and privileged execution to intersect without adequate controls. Regulators and auditors will look for audit trails, approval gates, access scope, and evidence that high-risk actions required separate authorisation. Without that, the system owner cannot credibly argue that the action was isolated or unintended.

Technical breakdown

How indirect prompt injection enters retrieval pipelines

Indirect prompt injection works when malicious instructions are embedded in content that a model later ingests through retrieval-augmented generation, browsing, or knowledge lookup. Because the model sees the malicious text in the same format as legitimate instructions, there is no native separation between data and directive. Attackers exploit that ambiguity with delimiters, role hijacking, language switching, and semantic rephrasing, then hide the payload in sources that look routine to users and defenders alike.

Practical implication: inspect retrieved content before it is merged into the model context, not only the user prompt.

Why agentic AI expands the blast radius

Agentic AI changes the impact profile because the model is no longer just producing text. It can call tools, make API requests, chain actions, and persist across sessions through memory. If an injected instruction survives into that execution path, the attacker can steer authorised systems with delegated identity rather than stealing a session outright. Model Context Protocol servers widen the surface because each connected tool path becomes part of the decision chain.

Practical implication: treat tool access, memory, and MCP connections as part of the identity perimeter.

Why pattern matching only catches part of the attack

Keyword filters and signature-based guards fail because indirect prompt injection is often expressed through meaning, not literal phrases. A malicious instruction can be encoded in another language, transformed with simple ciphers, hidden as invisible text, or phrased as a benign request that implies the same action. That makes the defence problem closer to intent recognition than to classic string matching, which is why broad content inspection and policy enforcement matter more than any single prompt rule.

Practical implication: use intent-based classification and bidirectional scanning, not standalone prompt filters.

Threat narrative

Attacker objective: The attacker wants to turn trusted content into a covert command channel that makes the AI system act on the attacker’s behalf without visible prompt abuse.

Entry occurs when the attacker plants malicious instructions inside an email, document, web page, or knowledge base that the AI system will later retrieve during normal operation.
Escalation happens when the model accepts the hidden instructions as legitimate context and applies them with delegated permissions, tool access, or memory persistence.
Impact follows when the injected directive triggers data exfiltration, tool manipulation, policy override, or other unauthorised action using the system's own credentials.

Schneider Electric credentials breach — exposed credentials gave attackers access to Schneider Electric Jira, exfiltrating 40GB.
Cisco DevHub NHI breach — IntelBroker exploited exposed Cisco credentials, API tokens and keys in DevHub.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Indirect prompt injection is an identity problem before it is a prompt problem. The attack succeeds because organisations allow untrusted content to enter the same decision space as trusted instructions. Once that boundary collapses, content governance and authorisation governance become the same control plane from the model’s point of view. Practitioners should treat retrieval, memory, and tool inputs as part of the identity surface, not as neutral text channels.

Agentic AI turns content poisoning into delegated credential abuse. The article is right to show that the real risk is not bad output alone, but unauthorised execution with system permissions. That moves the discussion from hallucination management to privilege containment, because the model can now operationalise the injected instruction through tools and APIs. The implication is that AI security must be evaluated as execution governance, not just answer quality.

Intent-based classification is the named control shift this category needs. Pattern matching was designed for fixed signatures and obvious malicious strings. Indirect prompt injection is often semantically disguised, multilingual, or embedded inside legitimate content structures, so the control premise fails at the detection layer. The implication is that security programmes need to stop assuming text can be safely judged by surface form alone.

Continuous inspection matters more than one-time trust decisions. The attack chain can begin in a source document, pass through retrieval, mutate in the response path, and then land in an autonomous action. That means the security model must follow the content all the way through the session, not stop at the first allow decision. Practitioners should design for inspection at every transition point where data can become instruction.

Shadow AI makes the governance gap wider than formal deployments suggest. The article’s discussion of hidden tools and agent sprawl reflects a structural reality: the exposure is not limited to sanctioned AI projects. Once teams lose visibility into which agents, connectors, and MCP servers exist, they also lose the ability to prove who or what had the authority to act. Practitioners should govern discovery before they attempt policy enforcement.

From our research:
97% of organisations that reported an AI-related breach lacked proper AI access controls, according to LLMjacking: How Attackers Hijack AI Using Compromised NHIs.
Shadow AI added $670,000 to average breach costs for organisations with high levels of unsanctioned AI usage, according to LLMjacking: How Attackers Hijack AI Using Compromised NHIs.
For a broader NHI control lens, see The 52 NHI breaches Report for recurring credential and access patterns that still drive modern identity incidents.

What this signals

Intent-based controls are becoming the practical dividing line for AI governance. Pattern filters can reduce noise, but they do not solve semantic abuse in retrieval pipelines or prompt context. Programmes that want durable control will need to classify meaning, inspect both directions of traffic, and treat AI inputs as governed identity events, not just text handling.

As AI systems absorb more data sources and tools, the governance perimeter shifts from application access to action authorization. That means identity teams should expect more overlap between content security, workload identity, and privileged execution. The organisations that formalise that overlap early will have a clearer path to auditability and safer delegation.

The scale of Shadow AI means visibility is now a prerequisite for policy enforcement. If teams cannot discover which agents, connectors, and MCP servers are active, they cannot prove which identity performed which action. That makes discovery, ownership, and evidence retention the first operational controls to stabilise.

For practitioners

Scan retrieval paths before model execution Inspect emails, documents, web pages, and knowledge base entries before they are merged into the model context. Flag delimiters, role hijacking language, hidden instructions, and transformed text before the model can interpret them.
Apply bidirectional inspection to model flows Validate both inputs and outputs so a poisoned response cannot become the next prompt, the next tool call, or an exfiltration channel. Extend inspection to user input, tool input, tool output, and final agent answers.
Treat tool access as privileged execution Restrict least-privilege tool use, scope-limited credentials, and human approval for high-risk actions such as financial changes, protected data access, and system modifications. Do not allow retrieval content to trigger irreversible actions without a separate control gate.
Map and monitor shadow AI entry points Discover MCP servers, agent connectors, and unmanaged AI workflows before policy rollout. Build an inventory that ties each agent action back to a human owner and a specific approval path.

Key takeaways

Indirect prompt injection succeeds by turning trusted content sources into hidden instruction channels, which collapses the boundary between data handling and execution control.
The impact grows sharply when AI systems have tool access, delegated credentials, or persistent memory, because the attack can move from bad output to unauthorised action.
Security teams need bidirectional inspection, intent-aware classification, and privileged action gates if they want AI governance to survive real-world content poisoning.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	AGENT-03	Prompt injection and tool misuse are central to this article.
OWASP Non-Human Identity Top 10	NHI-03	Delegated credentials and access scope are part of the attack impact.
NIST CSF 2.0	PR.AC-4	Least privilege and access governance are directly implicated.

Classify and test AI inputs, outputs, and tool calls for injection paths before execution.

Key terms

Indirect Prompt Injection: An attack in which malicious instructions are hidden inside content an AI system consumes during normal operation. The model treats that content as if it were legitimate context, which can redirect outputs, leak data, or trigger actions if downstream controls do not separate data from instruction.
Delegated Identity: An operating model where an AI system acts using permissions assigned by a human, service, or platform owner. In practice, that identity can make the model's actions appear authorised even when the content that shaped the decision was attacker-controlled, which makes approval boundaries and audit trails essential.
Bidirectional Inspection: A control approach that checks both AI inputs and AI outputs before they can influence model reasoning or external actions. It is more than prompt filtering because it inspects retrieved content, responses, and tool traffic to reduce the chance that poisoned text becomes an instruction or an exfiltration path.
Intent-Based Classification: A detection method that looks at what content is trying to do, not just whether it matches known attack strings. For AI governance, it is used to identify disguised instructions, semantic abuse, and harmful action requests that evade keyword rules or signature-based controls.

Deepen your knowledge

Indirect prompt injection and agent privilege governance are covered in our NHI Foundation Level course, the industry's only accredited NHI security programme. If your team is trying to secure AI content pipelines and delegated execution paths, this course is a practical starting point.

This post draws on content published by WitnessAI: Indirect prompt injection and the security implications for AI systems. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-03-06.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org