Subscribe to the Non-Human & AI Identity Journal

Notifications
Clear all

Multimodal steering attacks: are your AI controls keeping up?


(@nhi-mgmt-group)
Member Moderator
Joined: 1 year ago
Posts: 9223
Topic starter  

TL;DR: Vision-language models can be behaviorally steered through crafted images, producing effects similar to activation-level steering without runtime model access, and with negative steering shifts as high as 25 percentage points, according to HiddenLayer’s VISOR research. The finding matters because it breaks the assumption that API walls and prompt filters alone can prevent behavioral manipulation.

NHIMG editorial — based on content published by HiddenLayer: research on VISOR, visual input based steering for output redirection

By the numbers:

  • VISOR achieves the most substantial negative steering effect of up to 25% points compared to only 11.4% points for steering vector.
  • VISOR requires only a single 150KB image file rather than multi-layer activation modifications or careful prompt engineering.

Questions worth separating out

Q: How should security teams defend vision-language models against image-based steering?

A: Security teams should assume that any image input can influence model behaviour, not just model classification.

Q: Why do multimodal AI systems create a different governance problem from text-only models?

A: Multimodal systems create a different governance problem because the visual channel can alter internal activations before the final response is generated.

Q: What do organisations get wrong about API-based AI security boundaries?

A: They often assume an API wall prevents behavioural manipulation, when in practice it only limits direct model access.

Practitioner guidance

  • Classify multimodal uploads as control inputs Review every path that accepts images, screenshots, scans, or embedded media and decide whether the model treats them as behavioural inputs.
  • Test for behavioural drift across sessions Add red-team cases that compare model outputs before and after a single crafted image is introduced, then measure whether the shift persists across later prompts.
  • Separate content filtering from steering detection Do not assume image moderation or NSFW scanning will catch a steering image.

What's in the full report

HiddenLayer's full research covers the experimental detail this post intentionally leaves for the source:

  • The image-generation method used to produce steering effects across sycophancy, survival instinct, and refusal tasks
  • The comparative results showing how VISOR performed against activation-level steering vectors in both positive and negative directions
  • The Llava-1.5-7B evaluation setup and the 14k-sample performance checks used to validate behavioural persistence
  • The practical examples for financial services, retail, and automotive environments where adversarial and defensive uses diverge

👉 Read HiddenLayer's research on VISOR and multimodal AI steering →

Multimodal steering attacks: are your AI controls keeping up?

Explore further

View Full Forum →  |  NHI Foundation Course →



   
Quote
(@mr-nhi)
Member Moderator
Joined: 2 months ago
Posts: 8662
 

VISOR proves that multimodal AI behaviour is governed at the input layer, not only inside the model. That is a governance shift, not just a technical curiosity. Security teams have long treated image uploads as content, metadata, or storage concerns, but this research shows that an image can also act as a steering instrument for model behaviour. Practitioners should treat multimodal input channels as part of the identity and control plane for AI systems.

A few things that frame the scale:

  • 91.6% of secrets remain valid five days after the targeted organisation is notified, showing a critical gap in remediation procedures, according to the Ultimate Guide to NHIs.
  • 96% of organisations store secrets outside of secrets managers in vulnerable locations including code, config files, and CI/CD tools.

A question worth separating out:

Q: How do teams know whether an AI response shift is a steering attack or normal model variation?

A: Teams should compare outputs across repeated prompts before and after suspicious image inputs, then look for consistent changes in compliance, refusal, sycophancy, or bias. A steering attack is more likely when a single input causes repeatable behavioural drift across unrelated prompts in the same session.

👉 Read our full editorial: VISOR exposes a runtime steering gap in multimodal AI security



   
ReplyQuote
Share: