Notifications

Clear all

Multimodal steering attacks: are your AI controls keeping up?

Last Post

RSS

NHI Mgmt Group

(@nhi-mgmt-group)

Member Moderator

Joined: 1 year ago

Posts: 12387

Topic starter 05/07/2026 6:57 pm

TL;DR: Vision-language models can be behaviorally steered through crafted images, producing effects similar to activation-level steering without runtime model access, and with negative steering shifts as high as 25 percentage points, according to HiddenLayer’s VISOR research. The finding matters because it breaks the assumption that API walls and prompt filters alone can prevent behavioral manipulation.

NHIMG editorial — based on content published by HiddenLayer: research on VISOR, visual input based steering for output redirection

By the numbers:

VISOR achieves the most substantial negative steering effect of up to 25% points compared to only 11.4% points for steering vector.
VISOR requires only a single 150KB image file rather than multi-layer activation modifications or careful prompt engineering.

Questions worth separating out

Q: How should security teams defend vision-language models against image-based steering?

A: Security teams should assume that any image input can influence model behaviour, not just model classification.

Q: Why do multimodal AI systems create a different governance problem from text-only models?

A: Multimodal systems create a different governance problem because the visual channel can alter internal activations before the final response is generated.

Q: What do organisations get wrong about API-based AI security boundaries?

A: They often assume an API wall prevents behavioural manipulation, when in practice it only limits direct model access.

Practitioner guidance

Classify multimodal uploads as control inputs Review every path that accepts images, screenshots, scans, or embedded media and decide whether the model treats them as behavioural inputs.
Test for behavioural drift across sessions Add red-team cases that compare model outputs before and after a single crafted image is introduced, then measure whether the shift persists across later prompts.
Separate content filtering from steering detection Do not assume image moderation or NSFW scanning will catch a steering image.

What's in the full report

HiddenLayer's full research covers the experimental detail this post intentionally leaves for the source:

The image-generation method used to produce steering effects across sycophancy, survival instinct, and refusal tasks
The comparative results showing how VISOR performed against activation-level steering vectors in both positive and negative directions
The Llava-1.5-7B evaluation setup and the 14k-sample performance checks used to validate behavioural persistence
The practical examples for financial services, retail, and automotive environments where adversarial and defensive uses diverge

👉 Read HiddenLayer's research on VISOR and multimodal AI steering →

Multimodal steering attacks: are your AI controls keeping up?

Explore further

View Full Forum → | NHI Foundation Course →

Quote

Topic Tags

Mr NHI

(@mr-nhi)

Member Moderator

Joined: 3 months ago

Posts: 11961

05/07/2026 7:20 pm

VISOR proves that multimodal AI behaviour is governed at the input layer, not only inside the model. That is a governance shift, not just a technical curiosity. Security teams have long treated image uploads as content, metadata, or storage concerns, but this research shows that an image can also act as a steering instrument for model behaviour. Practitioners should treat multimodal input channels as part of the identity and control plane for AI systems.

A few things that frame the scale:

91.6% of secrets remain valid five days after the targeted organisation is notified, showing a critical gap in remediation procedures, according to the Ultimate Guide to NHIs.
96% of organisations store secrets outside of secrets managers in vulnerable locations including code, config files, and CI/CD tools.

A question worth separating out:

Q: How do teams know whether an AI response shift is a steering attack or normal model variation?

A: Teams should compare outputs across repeated prompts before and after suspicious image inputs, then look for consistent changes in compliance, refusal, sycophancy, or bias. A steering attack is more likely when a single input causes repeatable behavioural drift across unrelated prompts in the same session.

👉 Read our full editorial: VISOR exposes a runtime steering gap in multimodal AI security

ReplyQuote

Forum Statistics

11 Forums

13.6 K Topics

26.1 K Posts

23 Online

135 Members

Latest Post: LLM security and AI-driven crime: what security teams must change Our newest member: Alex Recent Posts Unread Posts Tags

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Get in Touch

Quick Links

FAQ

NHI 101 Articles

Legal & Policies