Subscribe to the Non-Human & AI Identity Journal

Multimodal model

A multimodal model processes more than one input type, such as text and images, in the same system. That broader input surface increases the trust problem because the model may not naturally distinguish user data from attacker-controlled instructions unless the application adds explicit governance around those inputs.

Expanded Definition

A multimodal model combines more than one input type inside a single system, commonly text, images, audio, or video. In NHI security, the key issue is not simply that the model is “larger,” but that each modality becomes a possible instruction channel, data source, and attack surface at the same time.

Definitions vary across vendors about where the boundary sits between the base model, the orchestration layer, and the application that wraps it. NHI Management Group treats the term operationally: if a model can consume multiple untrusted inputs and produce tool-using outputs, it needs governance for every modality, not just the prompt field. That is why multimodal deployments should be evaluated alongside NIST Cybersecurity Framework 2.0 controls for access, monitoring, and response.

The most common misapplication is assuming image or document inputs are passive data, which occurs when developers fail to treat embedded text, OCR output, or hidden instructions as attacker-controlled content.

Examples and Use Cases

Implementing multimodal model governance rigorously often introduces latency and review overhead, requiring organisations to weigh richer automation against stricter inspection of each input channel.

  • Customer support agents that read screenshots and chat text need separate filtering for visible text, metadata, and any extracted instructions before the model is allowed to act.
  • Invoice processing systems that accept PDFs and line-item text should validate whether the document contains embedded prompts, malicious overlays, or manipulated OCR results.
  • Security copilots that analyse diagrams and incident notes must restrict tool access until the application confirms which modality produced the relevant assertion.
  • Productivity assistants that accept voice plus document uploads should preserve provenance so investigators can see whether a command originated in speech, text, or generated context.
  • For a broader NHI governance lens, the Ultimate Guide to NHIs shows how broad attack surfaces and poor visibility amplify risk across machine identities, while NIST Cybersecurity Framework 2.0 helps structure the controls that surround model inputs and outputs.

Multimodal systems are especially useful when the task requires context that no single input type can provide, such as comparing a screenshot to a ticket description or extracting meaning from a chart and an accompanying policy note.

Why It Matters in NHI Security

Multimodal models matter because every additional input path creates a new opportunity for prompt injection, data exfiltration, or unsafe tool invocation. NHI security teams need to think in terms of trust boundaries, because the model may process user-supplied content, internal documents, and external artifacts in one workflow without reliably distinguishing intent from payload.

That distinction is critical in environments where the model can retrieve secrets, call APIs, or create tickets on behalf of a user. NHI Management Group notes that 80% of identity breaches involved compromised non-human identities such as service accounts and API keys, and multimodal workflows can widen that exposure when model actions are not tightly scoped and audited. The Ultimate Guide to NHIs also shows how poor visibility and excessive privilege compound the problem across machine identities.

Practitioners should pair modality-specific filtering, least privilege, and output review with logging that preserves which input influenced which action. Organisational risk often becomes visible only after a malicious document, image, or transcript has already triggered an unintended API call, at which point multimodal model governance becomes operationally unavoidable to address.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Agentic AI Top 10 A2 Covers prompt injection and unsafe tool use across AI workflows.
NIST CSF 2.0 PR.AC-4 Least-privilege access is essential when multimodal models can invoke actions.
NIST AI RMF Addresses AI risk identification and mitigation across model lifecycle stages.

Treat every modality as untrusted input and constrain tool execution behind policy checks.