Visual prompt injections expose a new control gap in multimodal AI

By NHI Mgmt Group Editorial TeamPublished 2026-04-20Domain: Agentic AI & NHIsSource: Lakera

TL;DR: Visual prompt injections embed malicious instructions inside images so multimodal models can be induced to ignore their original task, change outputs, or suppress nearby content, according to Lakera’s analysis. The governance gap is no longer just prompt hygiene, but control over what an AI system can do after it has already seen the image.

At a glance

What this is: This is an analysis of visual prompt injection, a multimodal AI attack that hides instructions inside images to steer model behaviour.

Why it matters: It matters because IAM and security teams now have to govern not just what AI systems can access, but how they can be manipulated after access is granted.

👉 Read Lakera's analysis of visual prompt injection in multimodal AI

Context

Visual prompt injection is a form of prompt abuse where malicious instructions are embedded in an image rather than typed into a chat box. Once a multimodal model reads that image, the attacker is no longer relying on direct access to the model prompt, but on the model's tendency to follow instructions it was never meant to trust.

For identity and access programmes, the important shift is that the control problem moves beyond authentication and basic authorisation. Multimodal AI can be tricked into producing outputs or taking actions that look legitimate unless teams treat image input, downstream tool use, and execution boundaries as part of the same governance problem.

That makes this topic relevant to both agentic AI security and broader NHI governance. The article's examples show why security teams must assess the execution layer, not just the input layer, when AI systems operate across enterprise workflows.

Key questions

Q: How should security teams test for visual prompt injection in multimodal AI systems?

A: Test the full input path, not just the model output. Use adversarial images with hidden text, overlays, and instruction-like captions, then verify whether the model changes its answer, ignores visible subjects, or triggers downstream actions. The goal is to prove that malicious image content cannot steer protected behaviour.

Q: Why do multimodal AI systems create new governance risks for identity teams?

A: Because the system can be reached legitimately and still be manipulated after access is granted. Identity controls may authenticate the user or service, but they do not stop attacker instructions buried in images from altering what the model says or does. That makes execution boundaries and content trust part of governance.

Q: What breaks when image inputs are allowed to influence tool use in AI workflows?

A: The application loses a reliable separation between perception and action. If a model can turn untrusted image content into search, retrieval, messaging, or suppression behaviour, a visual prompt injection can become an operational incident instead of a harmless output error. That is where approval gates and bounded permissions matter.

Q: How can organisations reduce the impact of prompt injections without blocking multimodal use?

A: Limit the model's authority. Keep image interpretation separate from privileged actions, log the chain from input to decision, and require human approval for any step that changes records, sends messages, or accesses sensitive data. The objective is to preserve multimodal capability while preventing untrusted content from becoming execution.

Technical breakdown

How visual prompt injection works inside multimodal models

A visual prompt injection places attacker instructions inside an image so the model encounters them while performing image understanding. In a multimodal system, the image is not just data, it becomes an instruction carrier once the model parses the visual content into tokens or latent representations that influence generation. This is especially risky when the model is expected to describe, classify, or extract actions from user-supplied images. The issue is not simple malware delivery. It is instruction collision, where attacker content competes with system prompts and task prompts in the same inference path.

Practical implication: separate trusted system instructions from untrusted visual content and treat image interpretation as an adversarial input path.

Why image inputs can override intended model behaviour

The article's examples show that carefully placed text in an image can alter what the model says, what it notices, and even whether it acknowledges a person at all. That happens because the model has no inherent concept of provenance for instructions embedded in pixels. If the model cannot distinguish user content from attacker intent, it may follow the most salient instruction it detects rather than the one the application intended. In practice, this creates a boundary failure between perception and policy. The model is doing exactly what it was trained to do, but the application is assuming the input is benign.

Practical implication: add content inspection and policy enforcement before multimodal outputs can influence downstream decisions or user-visible actions.

What changes when visual inputs reach agentic AI systems

The risk increases when a multimodal model is connected to tools, workflows, or autonomous action chains. At that point, a visual prompt injection is no longer just an output-quality issue. It can become an execution issue if the model is allowed to call tools, suppress competing content, or trigger follow-on actions based on manipulated image interpretation. This is where AI governance overlaps with NHI control thinking: the system needs bounded permissions, explicit approval gates, and observable execution paths. Without those, a crafted image can steer behaviour in ways the business never intended to authorise.

Practical implication: restrict tool access and approval-free actions for multimodal systems that can consume untrusted images.

NHI Mgmt Group analysis

Visual prompt injection is an execution-layer problem, not just a prompt problem. The article shows that malicious instructions hidden in images can change model output without changing the application code or the user-facing prompt. That means the trust boundary is inside the inference path, where untrusted content and system intent are already colliding. Practitioners should treat multimodal input as a governed execution surface, not a harmless content type.

Multimodal AI creates a control gap that classic IAM was never designed to cover. Identity controls can answer who may reach the system, but they do not answer whether the system can be manipulated after access is granted. The article's invisibility-cloak and advert examples show how policy can be bypassed through the perception layer. The implication is that access governance must expand to include content provenance, instruction separation, and action boundaries.

Agentic AI increases the blast radius of visual prompt injection. When a model can retrieve data or invoke tools, an image-based instruction becomes a possible trigger for unintended runtime behaviour. That is a different risk from chat prompt abuse because the harm is not limited to bad text generation. Security teams need to assume that any untrusted visual input may attempt to influence subsequent execution decisions.

Prompt injection resilience should be measured by control containment, not by model compliance claims. A system is not safe because it usually ignores bad instructions. It is safe only if malicious visual content cannot alter protected instructions, invoke privileged actions, or change what the application considers authoritative. Practitioners should focus on whether the model can be steered into actions that cross policy boundaries.

Visual prompt injection exposes the runtime governance gap in agentic AI security. This is the point where content moderation alone fails and execution governance begins. Teams should be asking where image-derived instructions can reach tools, records, or decisions, because that is where a benign-looking input becomes an operational incident.

From our research:
Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap, according to The State of Secrets in AppSec.
The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities.
For a broader identity governance view, see NHI Lifecycle Management Guide for lifecycle controls that help reduce persistent access and unmanaged exposure.

What this signals

Visual prompt injection should be treated as a governance signal, not a novelty attack. As multimodal systems move into customer support, internal search, and workflow automation, the question becomes whether image-derived instructions can ever reach privileged actions. Teams that already struggle with secret handling can expect the same discipline gap to reappear in AI control paths, especially where decision traces are weak.

The practical boundary is the execution layer, and that is where identity programmes need to focus their next controls. The more an AI system can retrieve data or invoke tools, the more important it becomes to define what untrusted content may influence, and what it never should. That is a direct fit with the broader shift toward Top 10 NHI Issues and disciplined access containment.

Multimodal instruction resistance: this is the control objective that emerges from the article's examples. If a model can be told by an image to ignore a person, suppress an advert, or answer a different question, then the programme needs assurance that perception does not override policy. That is the same governance logic now showing up across agentic AI and NHI workflows.

For practitioners

Separate untrusted image content from protected instructions Place multimodal prompt construction behind a trust boundary so image-derived text cannot be treated as system or developer instructions. Review how captions, OCR output, and metadata are merged before the model sees them.
Constrain tool use for multimodal models Limit which actions a model can trigger after reading an image, especially when those actions involve search, retrieval, messaging, or content suppression. Require explicit approval for any step that crosses a policy boundary.
Add adversarial image testing to red-team exercises Test whether off-white text, hidden instructions, overlays, and caption-like artefacts can steer outputs or suppress recognition. Include both benign failure cases and tool-connected scenarios in the same test plan.
Instrument decision traces for multimodal outputs Log the image source, the extracted text, the model response, and any downstream action so investigators can see where the model followed attacker content instead of intended policy.

Key takeaways

Visual prompt injection turns images into instruction carriers, which means the model's trust boundary is wider than the text prompt alone.
The evidence in the article shows that hidden instructions can alter recognition, outputs, and suppression behaviour, which is enough to justify execution-layer controls.
Security teams should test multimodal systems for adversarial image steering, restrict tool authority, and separate untrusted content from privileged action paths.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Visual prompt injection is an agentic AI input manipulation issue.
NIST AI RMF		AI RMF addresses governance and measurement for manipulated AI behaviour.
NIST Zero Trust (SP 800-207)	PR.AC-4	The attack shows access alone is not enough; actions must still be constrained.

Define accountable owners and test whether prompt injection can alter model decisions or downstream actions.

Key terms

Visual prompt injection: A visual prompt injection is an attack where malicious instructions are embedded in an image so a multimodal model follows them during interpretation. The risk is not limited to bad text output. It can also influence recognition, suppression, or downstream action when the system treats visual content as instruction-bearing input.
Multimodal model: A multimodal model processes more than one input type, such as text and images, in the same system. That broader input surface increases the trust problem because the model may not naturally distinguish user data from attacker-controlled instructions unless the application adds explicit governance around those inputs.
Execution layer: The execution layer is the part of an AI system where model output turns into an action, such as a tool call, search request, record change, or message. In security terms, this is where a prompt injection becomes operationally dangerous, because influence over the model can translate into influence over business behaviour.
Tool authority: Tool authority is the set of permissions an AI system has to reach other systems, call functions, or change state. For multimodal and agentic AI, the main risk is not only what the model can say, but what it can cause to happen once untrusted input steers its next action.

What's in the full article

Lakera's full article covers the operational detail this post intentionally leaves for the source:

Worked examples of invisibility-cloak style injections and how the model responded
Image-based prompt injection variants that change identity recognition and content suppression
Defence ideas for multimodal systems, including the visual prompt injection detector the vendor is building
Related reading links on prompt injection, content moderation, and AI red teaming

👉 Lakera's full article shows the example images, attack variations, and defence discussion in more detail.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building or maturing an identity security programme, it is worth exploring.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-04-20.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org