Prompt injection exposes the shared-responsibility gap in AI security

By NHI Mgmt Group Editorial TeamPublished 2026-02-27Domain: Agentic AI & NHIsSource: WitnessAI

TL;DR: Prompt injection is the top OWASP LLM vulnerability because attackers can override model behavior with plain language, and indirect injections in documents or retrieved data can redirect chatbots and agents without code exploits, according to WitnessAI. The real failure is assuming natural-language systems can be governed like structured applications when their input and instruction boundaries are not technically separable.

At a glance

What this is: Prompt injection is a natural-language attack against LLMs that can override intended behavior and redirect responses or agent actions.

Why it matters: It matters because IAM, NHI, and AI governance teams must treat prompts, retrieved content, and downstream actions as one attack surface, not isolated controls.

👉 Read WitnessAI's analysis of prompt injection and runtime AI defence

Context

Prompt injection is a governance problem for AI systems that process language as both input and instruction. The key weakness is not a broken API or malformed packet, but the absence of a reliable technical boundary between trusted directives and untrusted text.

For IAM and security teams, that means the control plane must extend beyond model access into runtime inspection, output filtering, and action gating. The same pattern applies across customer chatbots, internal copilots, and agentic workflows because any natural-language interface can be used to steer behaviour.

The practical question is no longer whether a model is helpful, but whether its surrounding controls assume structured input where none exists. That assumption breaks as soon as retrieved documents, emails, or chat messages can carry hidden instructions into the session.

Key questions

Q: How should security teams reduce prompt injection risk in AI assistants?

A: Security teams should place controls outside the model, not just inside the prompt. Inspect user input, retrieved content, and model output at runtime, then block any action that exceeds the task scope. The most effective programs combine prompt filtering, content isolation, and permission limits so a malicious instruction cannot become an authorised workflow step.

Q: Why do prompt injections create more risk for AI agents than for chatbots?

A: AI agents can turn a malicious instruction into a real action, such as a query, file transfer, or API call. A chatbot may only produce a harmful answer, but an agent can execute the harmful instruction inside connected systems. That makes permission scope and pre-execution review much more important for agents than for text-only assistants.

Q: What do teams get wrong about indirect prompt injection?

A: Teams often treat retrieved documents as safe because they are business content, not code. That assumption fails when a document carries hidden instructions that the model later processes as context. The right mental model is to treat any external text source as potentially adversarial until it has been inspected and constrained.

Q: Should organisations rely on model safety features alone to stop prompt injection?

A: No. Model-level guardrails reduce risk, but they do not define enterprise context, data boundaries, or action permissions. Organisations need their own enforcement layer for prompts, responses, and tool calls because the provider cannot know which business process is safe, which data is sensitive, or which action is out of bounds.

Technical breakdown

Direct prompt injection in chat interfaces

Direct prompt injection happens when an attacker sends malicious instructions straight into the conversational interface. The model receives those instructions alongside the system prompt and conversation history, then tries to satisfy the most recent or most salient text pattern. Techniques include role-play, multi-turn manipulation, encoded text, and instruction suppression such as ignoring previous directions. This is not a protocol exploit in the classic sense. It is a language conflict inside the model’s token stream, where the attacker’s text competes with the developer’s intended policy layer.

Practical implication: treat chat interfaces as untrusted entry points and inspect user prompts before they reach the model.

Indirect prompt injection through retrieved content

Indirect prompt injection hides malicious instructions inside documents, emails, shared files, or retrieved records that the model later consumes. Retrieval-augmented generation makes this especially risky because the model reads external content as context, not as potentially hostile payload. If the hidden text tells the model to reveal data, alter output, or trigger a follow-on action, the model may comply because it cannot reliably distinguish content from command. The attack is dangerous precisely because the malicious text looks like ordinary business content.

Practical implication: classify retrieved content as a potential control input and apply content sanitisation before retrieval or summarisation.

Why autonomous agents magnify prompt injection risk

The risk becomes materially worse when the LLM is attached to agentic workflows that can call APIs, query data, or execute multi-step tasks. In that setup, the injected instruction is no longer just a bad answer. It can become a tool call, a data export, or a transaction routed through legitimate permissions. This is where prompt injection turns into action abuse. The model is not simply generating text, it is authorising a downstream operation inside a workflow that may already trust the agent’s output.

Practical implication: require pre-execution checks and fine-grained action permissions for any agent that can affect systems or data.

Threat narrative

Attacker objective: The attacker aims to hijack model behaviour so the AI reveals data, changes outputs, or performs unauthorised actions through legitimate workflows.

Entry occurs when an attacker supplies direct malicious instructions in a chat session or embeds hidden instructions in content the model will later process.
Credential or data access occurs when the model follows those instructions and exposes prompts, records, or connected-system data that were never intended to be returned.
Impact occurs when a chatbot produces harmful output or an agent executes an unauthorised action, creating data exposure, brand damage, or operational loss.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Prompt injection is a governance failure, not just an application flaw. The attack succeeds because enterprises often treat LLM input as if it were ordinary text rather than a live control surface. That is a structural problem for OWASP-NHI and zero-trust governance, because the same session can contain instructions, data, and actions with no trustworthy separation. The implication is that AI governance has to assume hostile language at runtime, not just malicious users at login.

Shared-responsibility only works when the enterprise owns the runtime boundary. Model providers can harden the model, but they cannot define which documents, prompts, and downstream systems are safe inside a specific business process. That means the enterprise must govern the action layer, the retrieval layer, and the response layer as one identity chain. Practitioners should stop assuming provider-side alignment can substitute for local enforcement.

Indirect prompt injection shows that trusted data sources can become identity abuse vectors. A poisoned document, email, or knowledge-base record can carry instructions that steer a chatbot or agent into exposing information or taking action. This is not merely data quality risk. It is a control-plane failure where content is allowed to influence authority. Practitioners should treat retrieved content as an untrusted identity input.

Runtime inspection is becoming the named control pattern for AI interaction security. The article points to bidirectional defence because the same system must inspect prompts before model processing and filter outputs before they trigger action. That aligns with the broader NHI reality that static policy alone cannot manage a dynamic interaction chain. The implication is that prompt security, output gating, and behavioural monitoring need to be engineered together, not added as isolated point products.

From our research:
Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation, according to AI Agents: The New Attack Surface report.
80% of organisations report their AI agents have already performed actions beyond their intended scope, including accessing unauthorised systems and revealing access credentials.
That control gap is why OWASP NHI Top 10 remains relevant when language systems are allowed to influence identity, data, and action.

What this signals

Prompt injection is pushing AI security from model trust to runtime governance. The programme question is no longer whether the model is aligned, but whether every prompt, retrieval event, and tool call is covered by an enforceable control boundary. Teams that already map identity flows across human users and NHIs will find the same discipline useful here, but the control surface is broader because language can now initiate action.

OWASP Agentic AI Top 10 and the broader NHI governance conversation are converging on the same operational lesson: the dangerous moment is when text becomes authority. Teams should prepare for policy enforcement at the interaction layer, not only at authentication, because the blast radius lives in what the model can read, say, and trigger. That is where runtime inspection and behavioural monitoring become programme requirements, not architecture extras.

With 98% of organisations planning to deploy more AI agents within 12 months, the security gap will widen unless action gating and content isolation are built into the operating model. The immediate programme signal is to audit which assistants can reach sensitive systems, what inputs they trust, and where a malicious instruction could become an authorised event.

For practitioners

Inspect prompts before model execution Place a runtime inspection layer in front of every chatbot, copilot, or agent so untrusted instructions are screened before they reach the model or downstream action handler.
Separate retrieved content from authority Tag documents, emails, and knowledge-base records as untrusted inputs during retrieval and summarisation so hidden instructions cannot inherit system-level authority.
Constrain agent tool scope tightly Limit each agent to the smallest action set it needs, and block database queries, exports, or endpoint calls that are outside the current business task.
Add pre-execution approval for high-risk actions Require human review before an AI system can move money, change access, or transmit sensitive data, especially when the instruction originated from external content.

Key takeaways

Prompt injection works because LLMs do not reliably separate instructions from untrusted text at runtime.
The business risk rises sharply when a chatbot can move from bad output to unauthorised action through connected tools.
Security teams need external inspection, content isolation, and tight action permissions to govern AI safely.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	AI1	Prompt injection and tool misuse are core agentic AI risks in this article.
OWASP Non-Human Identity Top 10	NHI-01	The article centres on compromised non-human access and authority abuse.
NIST CSF 2.0	PR.AC-4	Access and authority boundaries are central to preventing injected actions.

Map AI actions to least-privilege access rules and review every connected system the model can reach.

Key terms

Prompt Injection: Prompt injection is an attack that uses natural language to override or redirect an LLM’s intended behaviour. Instead of exploiting code, the attacker supplies text that competes with system instructions and can influence responses, data disclosure, or downstream actions. The weakness is structural, because the model cannot reliably separate command from content.
Indirect Prompt Injection: Indirect prompt injection hides malicious instructions inside content the model later consumes, such as documents, emails, or retrieved records. The model treats that content as context, but the attacker uses it to steer behaviour. This makes ordinary business text a potential control surface and expands the attack beyond the chat box.
Runtime Inspection: Runtime inspection is the practice of checking prompts and outputs as AI interactions happen, before they reach the model or trigger an action. It is an enforcement layer outside the model itself. For AI governance, it is the control that can still act when the model’s own guardrails are not enough.
Agentic AI: Agentic AI is an AI system that can choose actions and execute multi-step tasks through connected tools or services. In practice, the governance issue is not just what it says, but what it can do. Once an agent can call APIs or move data, prompt manipulation becomes a direct operational risk.

Deepen your knowledge

Prompt injection and AI runtime governance are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building controls for chatbots, copilots, or agents that can act on text, it is worth exploring.

This post draws on content published by WitnessAI: Prompt injection as the number one vulnerability in LLM applications. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-02-27.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org