What Is Multimodal AI? Definition & Examples

Expanded Definition

Multimodal AI refers to AI systems that can interpret and combine signals from more than one modality, such as text, images, audio, video, or structured data, within a single inference or agentic workflow. In security terms, the key issue is not novelty of input types, but the fact that one modality can carry instruction-like content that changes behaviour even when the text layer appears benign. Guidance across vendors is still evolving on where multimodal interpretation ends and agentic action begins, so NHI teams should treat the boundary as operational rather than purely architectural. The most useful standards lens is NIST Cybersecurity Framework 2.0, because it ties this capability back to governance, risk management, and control validation instead of model novelty alone. NHI Management Group treats multimodal AI as a trust-boundary expansion problem, where every supported channel becomes a potential path for prompt injection, data leakage, or unsafe tool invocation. The most common misapplication is assuming only visible text prompts matter, which occurs when organisations ignore instructions embedded in images, transcripts, or other non-text inputs.

Examples and Use Cases

Implementing multimodal AI rigorously often introduces higher review and sanitisation overhead, requiring organisations to weigh richer automation against a broader attack surface.

An AI agent reviews customer support screenshots, and hidden text inside the image attempts to redirect the agent into disclosing account data.

A voice-enabled assistant transcribes a call and treats adversarial spoken content as operational instruction rather than user conversation.

A document-analysis workflow extracts text from PDFs, where embedded captions, annotations, or OCR artifacts carry malicious instructions that alter tool use.

A compliance agent correlates charts and narrative text, but a manipulated image steers it toward an incorrect risk conclusion before escalation.

Security teams investigating DeepSeek breach findings should consider how multimodal ingestion can amplify exposure when secrets or sensitive context are present across multiple inputs.

These use cases align with broader identity and access risk thinking described in LLMjacking: How Attackers Hijack AI Using Compromised NHIs, where attacker behaviour focuses on abusing the paths that AI systems trust. For control design, multimodal pipelines should be evaluated using the same discipline applied to external interfaces, especially when the model can trigger tools, retrieve records, or forward decisions into downstream automation.

Why It Matters in NHI Security

Multimodal AI matters because the attack surface extends beyond prompts into every ingestible channel that can influence an agent with execution authority. That creates direct governance implications for secrets, service accounts, and delegated access, especially when an AI system can read, summarise, and act on content that human reviewers would treat as untrusted. NHI Management Group research on secret exposure shows why speed matters: when AWS credentials are exposed publicly, attackers attempt access within an average of 17 minutes, and sometimes in as little as 9 minutes, as reported in LLMjacking: How Attackers Hijack AI Using Compromised NHIs. That urgency is even more dangerous when the compromised identity can be reached through a multimodal workflow that masks malicious instructions inside otherwise ordinary content. This is also why organisations should pair multimodal governance with NIST Cybersecurity Framework 2.0 practices for access control, monitoring, and response. Organisational resilience depends on recognising that the model is not just interpreting content, it may be operationalising it. Organisations typically encounter this consequence only after a malformed image, audio clip, or document causes an agent to misuse privileges, at which point multimodal AI becomes operationally unavoidable to address.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Agentic systems ingest multimodal inputs that can steer tool use and execution.
NIST AI RMF		AI RMF covers multimodal risk, misuse, and governance across inputs and outputs.
NIST CSF 2.0	PR.AC-4	Access control is central when multimodal AI can trigger downstream actions.

Treat every supported modality as an untrusted instruction path before the agent can act.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Multimodal AI

Expanded Definition

Examples and Use Cases

Why It Matters in NHI Security

Standards & Framework Alignment

Related resources from NHI Mgmt Group