AI explainability is creating new jailbreak paths for LLM security

By NHI Mgmt Group Editorial TeamPublished 2025-04-29Domain: Agentic AI & NHIsSource: CyberArk

TL;DR: CyberArk describes “adversarial AI explainability” as a way to study how LLM internals can be used to craft new jailbreak variants that bypass safeguards in both open- and closed-source models, while stressing that LLM output cannot be treated as a security control. The editorial issue is broader: once AI influences application behaviour, jailbreak resistance becomes an IAM and NHI governance problem, not just a model safety problem.

At a glance

What this is: This is CyberArk’s analysis of how explainability techniques can expose LLM behaviour and help create new jailbreak methods that defeat built-in safeguards.

Why it matters: For IAM and NHI practitioners, the key issue is that autonomous AI systems can influence decisions and actions without behaving like stable, trustworthy identities.

👉 Read CyberArk’s analysis of adversarial AI explainability and new jailbreak variants

Context

AI explainability is the practice of inspecting how a model behaves internally so defenders can understand why it produces a given output. In this post, the security gap is not just unsafe text generation. It is the possibility that an LLM embedded in business logic can be manipulated into taking actions that affect authentication, approvals, or downstream tool use, which turns model behaviour into an identity governance concern.

CyberArk’s research frames jailbreaks as more than content-filter bypasses. Once a model has decision-making influence inside an application or agent workflow, prompt manipulation can become a route to unauthorized action, weak control enforcement, or trust collapse across linked systems. That is not a niche model-safety issue. It is a sign that AI agents and LLM-mediated workflows need the same governance discipline applied to other powerful NHI classes.

Key questions

Q: How should security teams govern LLMs that can trigger tools or workflows?

A: Treat the LLM as an untrusted decision component, not an authorizer. Give it the minimum tool scope required, enforce policy outside the model, and require logging for every action it can influence. If the model can initiate work, then privilege, approval, and revocation controls must sit around it, not inside it.

Q: Why do jailbreaks matter when an LLM is embedded in business logic?

A: Because the risk is no longer limited to unsafe text. A successful jailbreak can alter downstream actions, data access, or workflow decisions, which means the model has crossed from content generation into operational influence. That is why prompt security and identity governance need to be managed together.

Q: What is the difference between model alignment and access control?

A: Model alignment shapes what the system tends to say, while access control governs what the system is allowed to do. Alignment can reduce harmful outputs, but it does not enforce authorization, scope, or revocation. For high-impact workflows, only external controls can provide that boundary.

Q: When should organisations reduce autonomous AI privileges?

A: Any time the model can reach sensitive data, trigger external tools, or influence customer-facing or production workflows. If a jailbreak could create real-world action, then standing access is too risky. Use task-scoped privileges and remove them as soon as the task ends.

Technical breakdown

How adversarial AI explainability changes jailbreak analysis

The article’s core idea is that explainability methods can reveal where a model is most vulnerable to adversarial prompting. By comparing activations across benign prompts and jailbreak variants, researchers look for neurons, layers, or pathways that shift when the model crosses from refusal into compliance. That does not prove a single causal mechanism, but it can identify fragile regions in the model’s decision surface. In practical terms, this is a way to move from black-box guessing toward targeted testing of model behaviour under attack.

Practical implication: Use behavioural testing and prompt-level adversary emulation to identify where your LLM controls fail before production use.

Why alignment is not a security boundary for LLMs

Alignment methods such as RLHF, DPO, and RLAIF shape how a model responds, but they do not create a hard security boundary. A model can still be induced to answer unsafe requests through encoding tricks, indirect prompts, or other transformations that shift the input into a region the model handles differently. The article also distinguishes external guardrails, like input filtering and human review, from internal alignment. That distinction matters because neither layer should be assumed to enforce policy by itself when the model is part of an application workflow.

Practical implication: Treat alignment as a safety layer, not an authorization control, and add independent validation before any AI output drives action.

Why agentic AI makes jailbreaks an identity problem

The article connects jailbreak risk to agentic systems because model output can influence tool use, application state, and decision flow. In that setup, a successful jailbreak can do more than produce a bad response. It can change how the system behaves, which means the model is effectively operating with execution authority. That is where AI security and IAM intersect: if the model can trigger tools, access data, or steer workflows, then its privileges, constraints, and auditability become part of NHI governance.

Practical implication: Map every LLM and agent interaction to explicit privileges, tool scopes, and audit controls before allowing autonomous execution.

Threat narrative

Attacker objective: The attacker wants the model to ignore safety constraints and produce outputs that can be used to drive harmful decisions or actions.

Entry occurs through a crafted prompt or encoded jailbreak variant that bypasses normal refusal behaviour.
Escalation follows when the model’s internal weak points or misaligned regions are exploited to generate unsafe or policy-violating outputs.
Impact occurs when those outputs influence downstream application logic, tool calls, or autonomous actions, turning a prompt-level bypass into operational misuse.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Adversarial AI explainability turns prompt injection into a governance problem, not just a model-safety problem. If a model can be steered by manipulating internal response paths, then security teams are no longer defending only content quality. They are defending the integrity of decision-making systems that may already be connected to tools, data, and workflows. That shifts the control question from “Can the model refuse?” to “What can the model influence?” and that is an NHI governance issue.

Jailbreak resistance needs a named control concept: identity blast radius. Once an LLM or agent can affect multiple tools, a single successful bypass can expand from one bad output into broader operational misuse. The blast radius depends on what the model can access, what it can trigger, and how quickly defenders can revoke those rights. Practitioners should reduce that blast radius before they try to perfect any model-level safeguard.

LLM alignment is useful, but it cannot carry authorization responsibility. The article is right to separate guardrails from internal model alignment, because neither is equivalent to least privilege or access control. If teams let the model decide without external policy enforcement, they are handing identity decisions to a component that is specifically designed to generalize, not to authenticate. That requires explicit policy checks outside the model boundary.

Agentic AI security will converge with NHI governance faster than many teams expect. The moment an LLM can call tools, the security conversation becomes about secrets, scopes, approvals, logging, and revocation. That is the language of NHI governance, not experimental AI research. Teams that already manage service accounts and API credentials are closest to the right control model, provided they extend it to autonomous behaviour.

Testing for jailbreak resilience should become part of the access review cycle. A model or agent that can be prompted around policy should not retain broad standing access until the next annual review. Security teams should treat prompt-based bypass as a control failure that affects privilege decisions, not as a purely theoretical research finding.

From our research:
80% of organisations report their AI agents have already performed actions beyond their intended scope, including accessing unauthorised systems (39%), inappropriately sharing sensitive data (31%), and revealing access credentials (23%), according to AI Agents: The New Attack Surface report.
Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation, according to AI Agents: The New Attack Surface report.
For a broader control baseline, the OWASP NHI Top 10 helps teams map agentic AI risks to concrete governance and testing priorities.

What this signals

Identity blast radius will become the practical metric for AI governance. Teams should stop asking only whether a model can be jailbroken and start asking how far a jailbreak could reach if it succeeds. The bigger the reachable tool set, the higher the operational blast radius, and the harder it becomes to contain misuse with post-incident review alone. Short-lived access and scoped execution are now core design choices, not optional hardening.

CyberArk’s research reinforces a broader industry pattern: AI systems that act on behalf of users inherit the governance burden of NHIs. That means access reviews, credential scoping, and revocation paths must cover model-driven workflows just as they do service accounts and automation. The right reference point is OWASP NHI Top 10, because model behaviour and identity control are now converging.

With 92% of organisations agreeing that governing AI agents is critical but only 44% having implemented any policies, per AI Agents: The New Attack Surface report, the gap is no longer awareness. It is operationalisation. Security teams should prepare for control exceptions, audit evidence, and revocation workflows specific to autonomous agents rather than assuming existing IAM patterns will stretch far enough.

For practitioners

Limit tool authority before tuning model safety Define explicit tool scopes, data access boundaries, and approval steps for every LLM or agent before it is allowed to execute actions. Separate read, suggest, and act permissions so a jailbreak cannot directly become a privileged workflow change.
Add adversarial prompt testing to pre-production gates Test encoding tricks, indirect prompts, and multi-step jailbreak variants against the model path your users will actually run. Record which prompts change refusal behaviour, and block deployment until the weakest paths are remediated.
Treat the model as an untrusted decision component Validate every high-impact output outside the model, especially anything that triggers access, payment, ticketing, or data movement. Keep a deterministic policy layer between the LLM and the system of record.
Shrink identity blast radius for autonomous workflows Use short-lived credentials, scoped tokens, and revocation-ready controls so an agent cannot hold broad standing access. Revisit service accounts and API keys that are reachable from model-driven paths.

Key takeaways

LLM jailbreak risk becomes materially worse when the model can influence tools, workflows, or access decisions.
AI explainability may improve testing, but it does not replace external authorization and least-privilege controls.
Security teams should govern autonomous AI as an NHI problem, with scoped access, revocation, and auditability built in.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	NHI-01	Prompt injection and tool misuse map directly to agentic AI abuse paths.
NIST AI RMF		AI governance and accountability are central when models influence operational decisions.
NIST CSF 2.0	PR.AC-4	Least-privilege access is the right control model for model-driven workflows.

Test agent prompts for manipulation paths and restrict tool use to explicitly approved actions.

Key terms

Adversarial AI Explainability: A research approach that uses model-introspection and behavioural analysis to understand how adversarial prompts affect an LLM. It is not a security control by itself. Its value is in revealing fragile model behaviours that defenders can test, monitor, and harden before those behaviours are exploited in production.
Jailbreak: A prompt or input transformation that bypasses a language model’s safety restrictions and causes it to produce output it would normally refuse. In operational settings, a jailbreak matters because the model may be embedded in a workflow, making the bypass a pathway to broader misuse, not just a bad response.
Identity Blast Radius: The amount of damage an autonomous system can cause if its privileges are abused or its behaviour is manipulated. For AI agents and LLM-driven workflows, blast radius depends on tool access, data scope, credential reach, and the speed of revocation. Reducing it is a core NHI governance objective.

Deepen your knowledge

AI explainability and agentic AI identity are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building controls for autonomous systems that can reach tools or data, it is worth exploring.

This post draws on content published by CyberArk: Unlocking New Jailbreaks with AI Explainability. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-04-29.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org