Subscribe to the Non-Human & AI Identity Journal

Prompt hijacking

Prompt hijacking is the use of malicious instructions to steer an AI system toward unintended behaviour over time. For agentic systems, the effect can persist beyond one interaction because memory, tool access, and chained delegation can preserve the attack influence across sessions.

Expanded Definition

Prompt hijacking is a form of instruction injection where attacker-controlled text overrides or reshapes the intended task of an AI system. In agentic environments, it matters because the injected instruction may be carried into memory, tool calls, delegation chains, or retrieval-augmented context, allowing the attacker to persist influence beyond a single turn. Definitions vary across vendors, but the practical concern is consistent: the system follows malicious instructions as if they were legitimate operator intent.

For NHI and agent governance, prompt hijacking is best understood as an integrity failure in the control plane of an AI agent, not just a content moderation issue. Frameworks such as NIST Cybersecurity Framework 2.0 emphasise protecting the integrity of systems, data, and access paths, which maps directly to prompt handling, memory boundaries, and tool authorization. The risk increases when an agent can act on secrets, invoke MCP-connected tools, or inherit privileges from upstream workflows.

The most common misapplication is treating prompt hijacking as simple user misconduct, which occurs when organisations ignore how injected instructions persist through memory, retrieval, or delegated actions.

Examples and Use Cases

Implementing prompt hijacking defenses rigorously often introduces friction in the user experience, requiring organisations to weigh stricter input controls against the flexibility that makes agentic systems useful.

  • A customer-support agent is tricked into prioritising attacker-written instructions embedded in a ticket, causing it to disclose workflow details or bypass normal escalation logic.
  • A coding agent ingests a poisoned repository comment, then follows the malicious instruction during later tool use, including access to secrets or CI/CD credentials.
  • A retrieval-augmented assistant pulls hostile text from a document corpus and treats it as higher-priority instruction than the system prompt or policy guardrails.
  • An autonomous workflow agent receives a seemingly harmless prompt that instructs it to maintain state, then later uses that memory to continue the attacker’s objective across sessions.
  • An operator reviews NHI governance and discovers that a service account with broad permissions enabled the attacker’s injected instruction to reach external APIs, a pattern discussed in the Ultimate Guide to NHIs.

These scenarios show why prompt boundaries must be treated like trust boundaries. The same mitigation logic that protects tokens, roles, and session scope in Ultimate Guide to NHIs also applies to agent instructions, especially where tooling or delegated execution is involved.

Why It Matters in NHI Security

Prompt hijacking becomes an NHI security problem when an AI agent can act with the authority of an identity, not just generate text. If the agent can read secrets, call APIs, rotate credentials, or approve actions, a successful hijack can turn one malicious instruction into a broader compromise. This is why prompt safety, entitlement design, and secrets hygiene must be considered together, rather than as separate teams or separate tools. The NHI guidance in Ultimate Guide to NHIs shows how excessive privilege, weak visibility, and poor rotation amplify downstream impact when an identity is abused.

One relevant signal is that 97% of NHIs carry excessive privileges, increasing unauthorised access and broadening the attack surface. When an agent is prompt-hijacked, excessive privilege turns a single instruction breach into lateral movement, data exposure, or unauthorised automation. That is why prompt hygiene, strict tool scoping, and Zero Trust controls should be aligned with NIST Cybersecurity Framework 2.0, especially around access control and protective technology.

Organisations typically encounter prompt hijacking only after an agent has already misused a tool, leaked a secret, or executed an unintended action, at which point the term becomes operationally unavoidable to address.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Agentic AI Top 10 Covers prompt injection and agent instruction misuse as core AI agent risks.
OWASP Non-Human Identity Top 10 NHI-02 Prompt hijacking becomes severe when secrets and tool access are exposed to malicious instructions.
NIST Zero Trust (SP 800-207) Zero Trust applies by verifying every request, including agent-originated instructions and tool actions.

Reduce secret exposure, scope permissions, and verify that agents cannot inherit unsafe authority.