Subscribe to the Non-Human & AI Identity Journal
Home Glossary Threats, Abuse & Incident Response Multimodal Prompt Injection
Threats, Abuse & Incident Response

Multimodal Prompt Injection

← Back to Glossary
By NHI Mgmt Group Updated June 11, 2026 Domain: Threats, Abuse & Incident Response

A prompt injection attack that arrives through non-text inputs such as images, audio, or video. The malicious instruction is hidden in content the system treats as ordinary user input, then surfaced by preprocessing or model interpretation and acted on by downstream tools or workflows.

Expanded Definition

Multimodal prompt injection is a form of instruction smuggling in which malicious guidance is embedded in non-text media such as images, audio, video, or scanned documents. The system may not treat the payload as a command until preprocessing, transcription, OCR, or model interpretation converts it into machine-usable text. That makes the attack especially relevant where an AI agent, workflow, or retrieval layer accepts mixed inputs and then acts on the extracted meaning.

Definitions vary across vendors because some teams reserve the term for attacks that directly alter model behaviour, while others include any hidden instruction that survives media-to-text conversion and reaches downstream tools. In NHI and agentic AI governance, the important distinction is not the format alone but the trust boundary: content that looks like user data is being treated as operational input. Guidance in the OWASP Agentic AI Top 10 and the NHIMG OWASP Agentic Applications Top 10 both point to the same operational concern: untrusted content can become instructions after interpretation. The most common misapplication is assuming “non-text” equals “non-executable,” which occurs when OCR, speech-to-text, or captioning outputs are not sanitised before agent execution.

Examples and Use Cases

Implementing multimodal prompt injection defences rigorously often introduces more validation steps and slower content handling, requiring organisations to weigh safer automation against lower throughput and higher review cost.

  • An uploaded image contains tiny embedded text that OCR later surfaces as a hidden command, causing an agent to alter a ticket or retrieve restricted data.
  • A voicemail or audio note includes spoken instructions that a transcription service converts into a tool-use directive for a customer support agent.
  • A PDF screenshot or scanned form contains adversarial text that is treated as part of the user request after document parsing.
  • A video frame includes a prompt disguised as on-screen text, which a vision model passes into a downstream workflow without human review.
  • A malicious attachment triggers a summarisation agent to follow embedded instructions rather than summarise the content objectively.

These patterns are especially dangerous in systems that combine media ingestion with autonomous action. The same governance logic that applies to secret handling in NHIMG research on exposed credentials also applies here: once untrusted input can influence action, the control boundary has already failed. For operational context, the broad attack surface described in NHIMG’s Ultimate Guide to Non-Human Identities helps explain why blended input pipelines deserve scrutiny, even when the payload is not a traditional credential attack.

Why It Matters in NHI Security

Multimodal prompt injection matters because many NHI and agent systems are built to consume content, classify it, and then act. If the model, parser, or orchestration layer cannot separate data from instruction, an attacker can redirect a workflow into leaking secrets, issuing unintended API calls, or changing records. That risk compounds in environments with weak secret hygiene and over-privileged automation. NHIMG research shows that 80% of identity breaches involved compromised non-human identities such as service accounts and API keys, and 97% of NHIs carry excessive privileges, which means one successful injection can quickly become a high-impact incident. The governance lesson is straightforward: media handling, parsing, and agent execution should be treated as one chained trust boundary, not separate problems.

Organisations typically encounter the consequence only after an agent has already processed a malicious image, audio clip, or document and executed an unintended action, at which point multimodal prompt injection becomes operationally unavoidable to address.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
OWASP Agentic AI Top 10A1Covers prompt injection risks in agentic systems, including attacks via non-text inputs.
OWASP Non-Human Identity Top 10NHI-04Focuses on agent/tool trust boundaries where hidden instructions can trigger unsafe execution.
NIST AI RMFAddresses AI system risks from adversarial inputs and unsafe downstream behavior.

Treat all parsed media as untrusted input and block agent actions until instructions are validated.

NHIMG Editorial Note
Reviewed and updated by the NHIMG editorial team on June 11, 2026.
NHI Mgmt Group — the #1 independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org