By NHI Mgmt Group Editorial TeamPublished 2025-08-18Domain: Agentic AI & NHIsSource: HiddenLayer

TL;DR: Vision-language models can be behaviorally steered through crafted images, producing effects similar to activation-level steering without runtime model access, and with negative steering shifts as high as 25 percentage points, according to HiddenLayer’s VISOR research. The finding matters because it breaks the assumption that API walls and prompt filters alone can prevent behavioral manipulation.


At a glance

What this is: VISOR is HiddenLayer’s finding that crafted images can steer vision-language model behavior through ordinary input channels, not model internals.

Why it matters: It matters because IAM, NHI, and AI governance teams now have to treat multimodal inputs as a control surface, not just a content channel, when they secure model behavior.

By the numbers:

  • VISOR achieves the most substantial negative steering effect of up to 25% points compared to only 11.4% points for steering vector.
  • VISOR requires only a single 150KB image file rather than multi-layer activation modifications or careful prompt engineering.

👉 Read HiddenLayer's research on VISOR and multimodal AI steering


Context

VISOR exposes a control gap in multimodal AI security: if an image can change a model’s behavioral tendency at runtime, then input filtering and system prompting are not enough on their own. The key issue for AI governance is not misclassification alone, but the fact that visual inputs can become an external steering mechanism for model output.

That matters for organisations using vision-language models in customer service, screening, moderation, and advisory workflows. The article’s core claim is that behavioural tampering can occur through a standard input path, which means programme owners need to think about model behaviour, input provenance, and session-level control together rather than as separate problems.


Key questions

Q: How should security teams defend vision-language models against image-based steering?

A: Security teams should assume that any image input can influence model behaviour, not just model classification. The practical response is to review upload paths, test for session-level behavioural drift, and monitor for outputs that change after a single crafted image. That is more effective than relying on prompt hardening alone.

Q: Why do multimodal AI systems create a different governance problem from text-only models?

A: Multimodal systems create a different governance problem because the visual channel can alter internal activations before the final response is generated. That means the trust boundary is not only the prompt box. It also includes image uploads, scanners, and any upstream system that passes visual inputs into inference.

Q: What do organisations get wrong about API-based AI security boundaries?

A: They often assume an API wall prevents behavioural manipulation, when in practice it only limits direct model access. If an attacker can provide a crafted image through the normal interface, they can still steer outputs without touching weights, code, or internal activations. Governance must reflect that reality.

Q: How do teams know whether an AI response shift is a steering attack or normal model variation?

A: Teams should compare outputs across repeated prompts before and after suspicious image inputs, then look for consistent changes in compliance, refusal, sycophancy, or bias. A steering attack is more likely when a single input causes repeatable behavioural drift across unrelated prompts in the same session.


Technical breakdown

How visual steering changes model activations

VISOR works by using a crafted image to induce activation patterns that resemble those produced by internal steering vectors. In multimodal models, the visual and text pathways share downstream processing, so an image can shift the model’s latent state before the final response is generated. This is different from a prompt injection that changes instructions textually. The steering effect persists because the image influences the model’s internal representation, not just the words in the prompt. The result is a runtime behavioural bias that can survive across many prompts in the same session.

Practical implication: input governance must treat images as behavioural inputs, not only as data to classify or store.

Why API walls do not remove steering risk

The article’s key architectural point is that model owners and API consumers do not face the same access conditions. Internal activation steering requires model-state access, but VISOR shifts the attack surface into the input layer, which is available to ordinary users or upstream systems. That makes the old security assumption false: only privileged insiders can alter behavioural control. A public-facing application that accepts uploads, screenshots, scans, or embedded images may unintentionally expose the model to behavioural modification without any code change, weight access, or prompt template overwrite.

Practical implication: upload paths, OCR pipelines, and multimodal APIs need the same review discipline as any other trust boundary.

What bidirectional steering means for alignment controls

A notable finding is that VISOR can steer both toward and away from specific behaviours, including compliance, refusal, and sycophancy. That symmetry matters because it shows the same control surface can be used defensively or offensively. In governance terms, a defensive image-based control is still a control that can be subverted by a stronger image-based control. The problem is not just malicious content, but the existence of an input mechanism that can reshape behaviour with minimal overhead and without model internals.

Practical implication: model safety reviews should test whether benign steering mechanisms can be inverted by hostile inputs.


Threat narrative

Attacker objective: The attacker’s objective is to redirect model behaviour without touching model weights or runtime internals, so the system produces outputs that favour the attacker’s intent.

  1. Entry occurs when an attacker sends a crafted image through a normal multimodal input path such as upload, email, or support-ticket attachment.
  2. Credential access is not the issue here, because the attack uses legitimate access to the model interface and manipulates hidden activations through the input layer.
  3. Impact follows when the model shifts into unsafe, biased, or misleading output patterns that persist across subsequent responses in the session.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.


NHI Mgmt Group analysis

VISOR proves that multimodal AI behaviour is governed at the input layer, not only inside the model. That is a governance shift, not just a technical curiosity. Security teams have long treated image uploads as content, metadata, or storage concerns, but this research shows that an image can also act as a steering instrument for model behaviour. Practitioners should treat multimodal input channels as part of the identity and control plane for AI systems.

The assumption that API access is a sufficient security boundary breaks here. That assumption was designed for model access patterns where only privileged operators could modify activations or weights. It fails when the actor can influence behaviour through ordinary user input, because behavioural control no longer requires internal access. The implication is that model governance must separate access control from behavioural trust assumptions.

Behavioural tampering is now a session-level risk, not only a model-level risk. VISOR’s persistence across subsequent outputs means one malicious input can shape multiple downstream responses without any code change or obvious operational alert. That makes the failure mode closer to runtime policy drift than classic prompt abuse. The practitioner conclusion is that response monitoring must watch for behavioural deviation over time, not just individual unsafe outputs.

Image-based steering expands the AI attack surface into the same trust problem NHI teams already know from secrets and delegated access. Once a standard input can alter downstream action quality, the distinction between authorised interaction and authorised control becomes much thinner. That matters for programmes that assume trust only needs to be enforced at authentication or routing points. Security teams should now ask where their AI systems accept untrusted inputs that can change decisions, not merely data.

Input-space steering is a named governance gap, and it will not be closed by prompt engineering alone. The article shows that carefully optimised images can reproduce effects similar to activation steering while remaining easy to distribute and hard to inspect. For practitioners, that means the control problem is no longer limited to better prompts or safer instructions. The real issue is whether the programme can detect when an ordinary-looking input has become a behavioural modifier.

From our research:

  • 91.6% of secrets remain valid five days after the targeted organisation is notified, showing a critical gap in remediation procedures, according to the Ultimate Guide to NHIs.
  • 96% of organisations store secrets outside of secrets managers in vulnerable locations including code, config files, and CI/CD tools.
  • That same lifecycle gap is why practitioners should also review OWASP Agentic AI Top 10 when multimodal inputs can alter model behaviour at runtime.

What this signals

Input-space steering should be treated as an AI governance design issue, not a model patching issue. If image inputs can shift behaviour, then the review scope has to include upload policy, session monitoring, and trust-zone design around inference paths. That is where AI security starts to overlap with identity governance: who can influence the system, through which channel, and under what assumptions.

Practitioners should expect multimodal models to force new control mappings between content ingestion and behavioural assurance. Existing content filters can remove obvious harm, but they do not prove that an apparently harmless file cannot steer a downstream decision path. That creates a need for behavioural testing in the same way identity teams test entitlement boundaries.

Runtime steering is now a governance control surface. In our view, the organisations that will handle this best are the ones that already think in terms of trust boundaries, delegated access, and session artefacts. The model may not be an identity in the classic sense, but the control problem is the same: untrusted inputs can still shape authorised outcomes.


For practitioners

  • Classify multimodal uploads as control inputs Review every path that accepts images, screenshots, scans, or embedded media and decide whether the model treats them as behavioural inputs. If yes, place those paths under the same review, logging, and approval discipline used for other high-risk control surfaces.
  • Test for behavioural drift across sessions Add red-team cases that compare model outputs before and after a single crafted image is introduced, then measure whether the shift persists across later prompts. Flag any session where bias, refusal, or compliance changes without a corresponding application change.
  • Separate content filtering from steering detection Do not assume image moderation or NSFW scanning will catch a steering image. Build tests for hidden behavioural modification, including cases where the image looks harmless but induces consistent response changes.
  • Inventory public-facing multimodal trust boundaries Map where customer uploads, support attachments, email-inbound images, and document parsing can reach model inference. Prioritise the flows where external users can influence behaviour without passing through a human review gate.

Key takeaways

  • VISOR shows that a single image can change vision-language model behaviour without direct model access, which turns the input layer into a governance surface.
  • The research demonstrates strong steering effects, including up to 25 percentage points of negative behavioural shift, so the risk is measurable rather than theoretical.
  • Teams should test for behavioural drift, map multimodal trust boundaries, and treat image ingestion as part of AI control design.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
OWASP Agentic AI Top 10Covers agentic and multimodal input abuse patterns relevant to VISOR.
NIST AI RMFAddresses governance and monitoring for AI behaviour changes.
NIST CSF 2.0PR.AC-4Access control assumptions fail when untrusted inputs can influence outcomes.

Add behavioural monitoring and risk review to the AI governance process for multimodal models.


Key terms

  • Steering Vector: A steering vector is a mathematical signal used to nudge a model toward or away from a behaviour during inference. In practice, it modifies internal activations so the model responds differently without changing the underlying weights or retraining the model.
  • Multimodal Input Channel: A multimodal input channel is any non-text path, such as an image, screenshot, or document scan, that reaches a model alongside text. For security teams, it matters because the channel can carry behavioural influence as well as visible content.
  • Behavioural Drift: Behavioural drift is an unwanted change in a model’s output style, refusal behaviour, or compliance pattern over time or after a specific input. It can indicate steering, prompt abuse, or other hidden influence that changes how the model acts in practice.
  • Runtime Steering: Runtime steering is the act of changing model behaviour while the model is executing, rather than by retraining or editing weights. It is especially important in multimodal systems because an external input can act on internal state during inference.

What's in the full report

HiddenLayer's full research covers the experimental detail this post intentionally leaves for the source:

  • The image-generation method used to produce steering effects across sycophancy, survival instinct, and refusal tasks
  • The comparative results showing how VISOR performed against activation-level steering vectors in both positive and negative directions
  • The Llava-1.5-7B evaluation setup and the 14k-sample performance checks used to validate behavioural persistence
  • The practical examples for financial services, retail, and automotive environments where adversarial and defensive uses diverge

👉 HiddenLayer's full post covers the image-based attack path, evaluation data, and deployment implications in more detail.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or governance in your organisation, it is worth exploring.
NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-08-18.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org