By NHI Mgmt Group Editorial TeamPublished 2025-08-26Domain: Agentic AI & NHIsSource: HiddenLayer

TL;DR: ShadowLogic backdoors can be embedded in trusted model formats such as ONNX and TensorRT, survive conversion, and remain effective even after downstream fine-tuning, according to HiddenLayer, which makes model supply chain trust harder to justify. Persistent logic inside the graph turns model provenance into an access-control problem, not just a model-quality issue.


At a glance

What this is: This is a research post on ShadowLogic backdoors in AI model supply chains, showing that malicious logic can persist through model conversion and downstream fine-tuning.

Why it matters: It matters because IAM, NHI, and AI governance teams must treat model provenance, deployment trust, and toolchain integrity as part of identity and access risk, not just ML engineering hygiene.

By the numbers:

👉 Read HiddenLayer's research on persistent ShadowLogic backdoors in AI models


Context

ShadowLogic is a form of model tampering in which attacker-defined logic is inserted into a machine learning model so the model behaves normally until a trigger appears. For practitioners, the governance problem is not only model accuracy. It is whether a supposedly trusted model artifact can carry persistent hidden behaviour across packaging, conversion, and deployment.

This matters for AI-enabled security systems, but also for the broader identity stack around them. Once a model can be modified to change decisions after it is approved, model trust starts to resemble privileged access governance: you need to know who can alter the artifact, how that change is reviewed, and whether downstream systems can detect the difference.


Key questions

Q: How should security teams govern machine learning models that may contain hidden backdoors?

A: Security teams should govern machine learning models as controlled artifacts, not as passive files. That means validating provenance, inspecting exported graphs, tracking custody across conversion steps, and testing for malicious trigger behaviour before deployment. If a model can influence security or access decisions, hidden logic in the artifact becomes a governance issue, not only an ML quality issue.

Q: Why is model conversion risky when the source artifact may be tampered with?

A: Conversion is risky because it preserves structure, not trust. If malicious logic already exists in the source model, exporting it to ONNX, TensorRT, or another runtime format can carry the backdoor forward intact. Teams should assume the conversion pipeline can faithfully preserve both legitimate functionality and hidden control flow unless they verify the artifact separately.

Q: What do security teams get wrong about fine-tuning compromised models?

A: They often assume fine-tuning will wash out prior compromise, but that only applies when the issue lives in the learned weights. A graph-level backdoor can persist after additional training because the hidden logic still exists in the deployed artifact. Fine-tuning may improve accuracy while leaving the attack path untouched.

Q: How can organisations reduce the risk of malicious model supply chain attacks?

A: Organisations should combine provenance checks, artifact signing, graph inspection, and adversarial testing before models reach production. If a model is sourced externally or converted between formats, every transition should be treated as a new trust boundary. The aim is to prove what the model is, not just whether it performs acceptably on clean data.


Technical breakdown

How ShadowLogic hides malicious logic inside model graphs

ShadowLogic works by modifying the computational graph rather than retraining the model alone. In practice, that means the attacker inserts conditional branches that watch for a trigger pattern and then substitute attacker-chosen outputs. Because the malicious logic becomes part of the exported graph, the model can still look structurally valid in formats such as ONNX or TensorRT while carrying hidden behaviour. This is different from simple data poisoning, where the backdoor often depends on learned weights and can be weakened by later retraining. Practical implication: treat graph-level model inspection as part of artifact approval, not just accuracy testing.

Practical implication: Inspect model graphs and exported artifacts for unexpected branches before approving them for production use.

Why model conversion does not remove the backdoor

A key property of ShadowLogic is persistence across conversion pipelines. When a model is moved from a training framework into production formats, the malicious control flow can be preserved because the conversion process translates the graph, not the intent behind it. That makes “safe” deployment formats only partially safe if the source artifact has already been altered. In identity terms, this is a supply chain trust problem: the system assumes the imported object is what it claims to be. Practical implication: validate provenance and run artifact-level checks at each conversion boundary.

Practical implication: Verify provenance at every model packaging boundary, not just at the original source repository.

Why downstream fine-tuning may not remove a graph backdoor

Fine-tuning changes model behaviour by adjusting weights on new data, but it does not necessarily rewrite or erase malicious graph logic embedded before deployment. HiddenLayer’s example shows that a graph backdoor can survive later clean fine-tuning even when a conventional backdoor weakens. That matters because many organisations assume retraining or tuning is a remediation step. It is not, if the malicious path still exists in the deployed artifact. Practical implication: do not rely on tuning as a cleanup method for a model artifact that may already contain hidden control flow.

Practical implication: Do not assume additional fine-tuning will cleanse a tampered model artifact.


Threat narrative

Attacker objective: The attacker wants a persistent, triggerable model bypass that survives deployment changes and enables misclassification at will.

  1. Entry occurs when a malicious or tampered model is introduced through the supply chain and accepted as a trusted artifact.
  2. Escalation happens when the backdoor is embedded in the computational graph, giving the attacker hidden control over model output when the trigger appears.
  3. Impact is realised when the deployed model misclassifies on demand while appearing normal during standard validation and later tuning cycles.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.


NHI Mgmt Group analysis

Model supply chain integrity has become an identity control problem. Once a model artifact can carry hidden execution logic, approval is no longer just about provenance labels or checksum validation. The real question is who can alter the graph, who can attest to what is inside it, and whether downstream systems can detect hidden decision paths. Practitioners should treat model artifacts as governed assets with access boundaries, not inert files.

Persistent backdoors create an identity blast radius inside AI systems. A graph-level backdoor is not limited to one training run or one deployment target. It can survive format conversion and remain active after later tuning, which means the trust failure travels with the artifact. The implication for security teams is that model assurance must be continuous across the entire model lifecycle, not confined to pre-production review.

Model conversion pipelines need stronger provenance and custody controls. The article shows that ONNX and TensorRT are not protections by themselves when the source model has already been altered. That means the control gap sits in handoff governance: artifact signing, tamper detection, and custody tracking between training, conversion, and deployment. Practitioners should re-evaluate whether their model pipeline can prove unchanged lineage end to end.

Security teams should stop treating ML backdoors as a pure data-science issue. The operational risk is broader because these models often sit inside systems that influence detection, classification, or access decisions. If a model can be coerced into the wrong answer on trigger, the downstream system inherits a compromised trust decision. The practical conclusion is that ML security, software supply chain security, and identity governance now overlap in the same control plane.

ShadowLogic names a specific named concept: persistent graph backdoor. This is a backdoor embedded in model structure rather than learned only through weights, which makes it harder to remove through ordinary retraining. For practitioners, the important distinction is between a degraded model and a persistently tampered one. The latter requires artifact-level trust controls, not just model-performance monitoring.

From our research:

  • The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
  • 43% of security professionals are concerned about AI systems learning and reproducing sensitive information patterns from codebases, according to The State of Secrets in AppSec.
  • For related analysis, see Top 10 NHI Issues for the controls most often weakened by artifact trust failures.

What this signals

Persistent graph backdoors turn model assurance into lifecycle governance. Once a model can preserve hidden behaviour through conversion and later tuning, the programme needs custody controls, not just validation checks. The operational lesson is that every model handoff becomes a trust boundary, which is why artifact signing and provenance tracking belong alongside deployment approvals.

With 43% of security professionals concerned that AI systems may learn and reproduce sensitive information patterns from codebases, the concern is no longer limited to prompt leakage or data exposure. The broader signal is that AI pipelines now need controls for what is absorbed, what is embedded, and what survives export. That is a programme design issue, not a model-team detail.

Teams that rely on external models should align review processes with OWASP Non-Human Identity Top 10 and NIST Cybersecurity Framework 2.0 thinking. The reason is simple: a model that can change decisions through hidden logic behaves like a privileged asset with an untrusted change history.


For practitioners

  • Inspect exported model graphs before approval Review ONNX, TensorRT, and other deployed artifacts for unexpected conditional branches, output substitution nodes, and trigger-detection logic before they are promoted to production.
  • Require signed provenance across the model pipeline Enforce artifact signing and custody tracking from training output through conversion and deployment so a tampered model cannot enter production unnoticed.
  • Treat fine-tuning as a performance step, not a cleanse step Assume downstream tuning may improve accuracy while leaving hidden graph logic intact, so separate remediation from retraining in your workflow.
  • Add ML supply chain checks to governance reviews Bring model artifact review into the same governance process used for privileged software artifacts, including approvals, exceptions, and change traceability.
  • Validate trigger resistance under adversarial testing Test deployed models with controlled trigger patterns and compare behaviour before and after conversion so hidden branches are more likely to surface.

Key takeaways

  • ShadowLogic shows that AI model backdoors can live in the graph itself, which makes them harder to remove than ordinary bad training data.
  • The evidence in the article shows that these backdoors survive format conversion and can persist after later fine-tuning, which expands the trust problem across the full model lifecycle.
  • Practitioners should respond by governing model provenance, artifact integrity, and conversion boundaries as security controls, not by assuming retraining will fix a compromised model.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
OWASP Agentic AI Top 10A2Hidden model logic can hijack agent-like decision paths in AI systems.
NIST AI RMFGOVERNModel custody and provenance are governance concerns, not just engineering checks.
NIST CSF 2.0PR.DS-6Model artifact integrity depends on protecting software and data from unauthorized modification.

Inspect agentic model artifacts for hidden branches and validate runtime behaviour before deployment.


Key terms

  • Shadow backdoor: A hidden malicious path inside a model or system that stays dormant until a trigger appears. In AI models, this means the artifact can behave normally during testing but produce attacker-chosen output under specific conditions, which makes ordinary validation incomplete.
  • Model provenance: The traceable history of where a model came from, who changed it, and how it moved into production. For security teams, provenance is the evidence chain that helps prove an artifact has not been altered in ways that undermine trust or decision integrity.
  • Computational graph: The structure that describes how data flows through a model from input to output. In a backdoor scenario, the graph matters because malicious logic can be inserted as control flow, allowing the model to preserve hidden behaviour even when weights or formats change.
  • Artifact custody: The controlled handling of a digital object across its lifecycle, including approval, conversion, storage, and deployment. In model security, custody defines who can alter the artifact and where integrity checks must exist so tampering cannot move silently downstream.

What's in the full report

HiddenLayer's full research covers the operational detail this post intentionally leaves for the source:

  • Side-by-side model graph examples showing how the backdoor logic is embedded in ONNX and TensorRT artifacts.
  • The full efficacy table comparing base, ShadowLogic, and fine-tuned backdoor performance across conversion steps.
  • Additional visual evidence of the trigger path and the model branches that preserve malicious output selection.
  • The conversion and retraining sequence used to test whether the backdoor survives downstream lifecycle changes.

👉 HiddenLayer's full post shows the model graphs, conversion results, and backdoor persistence tests in detail.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.
NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-08-26.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org