AI jailbreaks expose how foundation model safeguards fail

By NHI Mgmt Group Editorial TeamPublished 2026-01-14Domain: Agentic AI & NHIsSource: ZioSec

TL;DR: AI jailbreaks use creative prompts, roleplay, and other coercive techniques to bypass model safeguards and extract restricted output from systems such as ChatGPT, Claude, and Gemini, according to ZioSec. The practical lesson is that prompt filters alone do not close the governance gap when sensitive information can be elicited through indirect, adversarial instruction.

At a glance

What this is: This is an analysis of AI jailbreak techniques and the key finding that creative prompting can bypass foundation model safeguards to reveal sensitive information.

Why it matters: It matters because IAM and security teams now have to govern how AI systems disclose, not just who can authenticate, especially when agentic and human workflows share the same data boundary.

👉 Read ZioSec's analysis of AI jailbreak techniques and foundation model risks

Context

AI jailbreaks are prompts designed to bypass a model's built-in safety rules and coax it into producing restricted or sensitive content. In practice, that turns the model's response policy into a security boundary, which is a poor fit for environments that expect deterministic enforcement.

For identity and access teams, the real issue is not simply model misbehaviour but governance of what the model can reveal when it is connected to internal data, tools, or agent workflows. Once AI systems are part of access paths, prompt-based exfiltration becomes an identity and control problem, not only a content-safety problem.

The article's examples show that subtle prompt shaping can expose system prompts and other confidential details, which is typical of adversarial AI testing rather than an isolated curiosity. That makes the topic relevant to NHI, autonomous, and human identity programmes wherever model access intersects with secrets, delegated data, or privileged workflows.

Key questions

Q: How should security teams reduce the risk of AI jailbreaks in model-enabled workflows?

A: Security teams should treat jailbreak risk as a control-design problem, not only a prompt-filter problem. Minimise the sensitive data exposed to the model, separate generation from privileged execution, and place deterministic policy checks between the model and any tool, export, or retrieval action. If the model can see it, assume an attacker may try to make it say it.

Q: Why do AI jailbreaks matter for identity and access governance?

A: AI jailbreaks matter because models increasingly sit inside access paths to data and tools. When a prompt can elicit hidden instructions or sensitive output, the real failure is that identity governance is being enforced through language rather than entitlement. That creates an exposure gap for secrets, delegated access, and downstream automation.

Q: What do security teams get wrong about model safety filters?

A: Teams often assume safety filters stop harmful output in the same way an access control stops unauthorised access. They do not. Safety layers are probabilistic and context-sensitive, so indirect phrasing, roleplay, and creative framing can bypass them. The correct stance is layered control, not confidence in a single guardrail.

Q: How can organisations test whether a chatbot is leaking sensitive information?

A: Use controlled red-team prompts that try indirect extraction through stories, poems, translation, and multi-turn steering. Look for leaks of system prompts, policy text, hidden instructions, and confidential retrieval content. If the same model behaves differently under subtle framing, it is revealing a governance weakness that should be treated as a security defect.

Technical breakdown

How prompt-based jailbreaks bypass model safeguards

A jailbreak exploits the difference between what a model is trained to avoid and what it can still be persuaded to say. Models do not enforce security the way an access-control engine does. They generate the most likely next tokens, so an attacker can use indirection, roleplay, translation, or creative framing to shift the output into forbidden territory. The safety layer may block obvious requests, but it is still operating in a probabilistic environment where language manipulation can change the response path. That makes prompt design part of the attack surface, especially when model output is treated as trustworthy by downstream systems.

Practical implication: Treat model output as untrusted until validated against policy, data classification, and downstream control checks.

Why creative writing works as an attack path

Creative tasks are useful to attackers because they create semantic cover. A poem, story, or fictional chapter can embed instructions or requests that would be rejected if phrased directly. The model sees a legitimate genre request, but the hidden objective is to surface restricted information, system prompts, or policy exceptions. This is especially dangerous when the model has long context windows or has been exposed to system-level instructions that shape behaviour across turns. The attack does not break the model in a cryptographic sense. It exploits the fact that natural language is ambiguous and the safety boundary is interpretive, not absolute.

Practical implication: Limit what sensitive instructions and secrets are ever placed inside model context in the first place.

Agentic AI flows widen the blast radius of jailbreaks

When a model is embedded in an agentic workflow, jailbreaks are no longer just about bad answers. They can influence tool selection, data retrieval, and chained actions if the model's output is consumed by orchestrators or plugins. That is where a prompt-level bypass becomes a control-plane issue. If the workflow trusts the model to decide what to fetch, summarize, or reveal, then a successful jailbreak can alter action paths as well as language output. In that setting, the model becomes part of an identity and authorization chain, and the failure mode is broader than content leakage.

Practical implication: Separate generation, authorization, and execution so a compromised prompt cannot drive privileged actions.

ASP.NET machine keys RCE attack — 3,000+ exposed ASP.NET machine keys enabled remote code execution.
DeepSeek breach — DeepSeek breach exposed 1M+ log lines and sensitive secret keys.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Prompt injection is not just a content-safety problem. It is a governance failure when models sit inside access paths. The article shows that indirect prompts can elicit system prompts and other confidential details, which means the control boundary is being enforced in language rather than in entitlement logic. That is a weak boundary for any environment where internal data, secrets, or workflow actions are reachable through the model. Practitioners should treat jailbreak resistance as part of access governance, not an isolated AI hygiene task.

Creative framing is a named attack pattern, not an edge case. The article's poetry and horror-novel examples demonstrate how attackers use benign-looking genres to hide malicious intent. That matters because policy filters are often tuned to obvious abuse, while indirect phrasing slips through the cracks. The field should stop assuming that direct harmful requests are the main risk; the more durable threat is semantic camouflage that reuses ordinary language to reach restricted outputs.

Access review assumptions do not map cleanly onto conversational AI. Access reviews were built for stable entitlements, known principals, and observable permission sets. A model session can produce different disclosures from the same context depending on prompt shape, conversation history, and hidden instructions. The implication is that governance must account for runtime behaviour, not just assigned access, because the effective exposure surface changes with each interaction.

Model prompt leakage creates an identity blast radius beyond the model itself. Once a system prompt, policy instruction, or embedded secret is exposed, an attacker can reuse that information to target adjacent workflows, other prompts, or downstream automations. That is why foundation model security cannot be separated from the surrounding identity fabric. The practitioner conclusion is straightforward: whatever the model can reveal can become a launch point into the rest of the environment.

Foundation model security inherits the weakest assumptions of the surrounding control plane. If a chatbot or agent is trusted to mediate access to internal information, then a jailbreak becomes an authorization bypass in practice even if it is not one in protocol terms. The model may still be compliant with its own rules while the larger workflow fails. Practitioners should evaluate the full chain from prompt to data exposure, not just the model in isolation.

From our research:
1 in 4 organisations are already investing in dedicated NHI security capabilities, with an additional 60% planning to do so within the next twelve months, according to The State of Non-Human Identity Security.
72% of organisations have experienced or suspect they have experienced a breach of non-human identities, including 46% confirmed and 26% suspected, according to The 2024 ESG Report: Managing Non-Human Identities.
For a broader breach lens, the 52 NHI Breaches Analysis helps teams connect prompt-driven leakage risks to the recurring failure patterns seen across machine identity incidents.

What this signals

Prompt leakage is becoming a governance signal, not just a model-quality issue. As foundation models move deeper into business workflows, the question shifts from whether a prompt is blocked to whether a coerced response can surface data that should never have been in reach. That is why teams should align model security reviews with access reviews, data classification, and privileged workflow mapping, using the NIST Cybersecurity Framework 2.0 as a broader governance scaffold.

Semantic camouflage is the named concept security programmes should track. Indirect, creative, or fictional framing can bypass obvious policy checks because it hides malicious intent inside ordinary language. Practitioners should build testing around prompt reformulation, context leakage, and hidden-instruction exposure, and they should compare those results with controls in the MITRE ATLAS adversarial AI threat matrix.

The operational signal to watch is whether the model can reveal more under subtle prompting than it should under direct prompting. If the answer is yes, the programme is depending on the model to self-police disclosure. That is not a stable control model for AI systems connected to secrets, documents, or tool execution, and it should trigger tighter context governance and agent boundary design.

For practitioners

Classify model context as sensitive data exposure surface Map which prompts, system instructions, and retrieved documents can reveal confidential information if a jailbreak succeeds. Remove secrets, policy text, and privileged operational detail from anything the model can retain or echo back. This is especially important where a chatbot sits in front of internal systems or shared knowledge stores.
Separate generation from execution authority Do not let model output directly trigger privileged actions, tool calls, or data exports. Use explicit authorization gates, deterministic policy checks, and scoped delegation so a coerced response cannot alter the execution path. Keep the model advisory when the output could affect access decisions.
Red-team indirect prompt paths, not only obvious abuse Test poetry, roleplay, fictional narration, translation, and multi-turn steering techniques against any model that handles internal content. Measure whether the system leaks hidden instructions, stored context, or policy fragments under subtle framing. That reveals the real attack surface faster than simple blocked-bad-word testing.
Monitor for prompt leakage signals in agentic workflows Log repeated context-reconstruction attempts, unusual prompt reformulation, and requests that try to elicit system-level instructions. When models are attached to tools, watch for attempts to use the chatbot as an indirect retrieval layer for restricted data. Connect those signals to incident response and access review.

Key takeaways

AI jailbreaks show that foundation model safeguards can be bypassed through indirect, creative prompting rather than obvious malicious requests.
The scale of the problem grows sharply when models are connected to internal data, tools, or agentic workflows because disclosure becomes an access issue.
Practitioners should respond by reducing sensitive context, separating generation from execution, and testing for semantic camouflage as a real attack path.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A1	Jailbreaks exploit prompt control and unsafe model behavior in agentic settings.
NIST AI RMF		AI RMF governance applies to disclosure risk and runtime model behavior.
NIST CSF 2.0	PR.AC-4	Access control must cover downstream data exposure from model-mediated workflows.

Establish governance for model output, context handling, and incident response before deployment.

Key terms

AI Jailbreak: An AI jailbreak is a prompt technique used to bypass a model's built-in safety constraints and elicit restricted output. The attack works by changing the framing, context, or conversational path so the model reveals information it was intended to withhold.
Prompt Injection: Prompt injection is the act of inserting instructions that redirect a model away from its intended behavior. In practice, it can be direct or indirect, and it becomes especially risky when the model can access tools, documents, or internal instructions.
System Prompt: A system prompt is the hidden or high-priority instruction set that shapes a model's behavior across a conversation. If exposed, it can reveal policy logic, hidden constraints, or operational details that attackers can reuse to improve later bypass attempts.
Semantic Camouflage: Semantic camouflage is the use of ordinary-looking language to conceal malicious intent inside a prompt or instruction chain. It matters because models may treat the request as benign content generation while the attacker is actually steering toward disclosure or policy evasion.

Deepen your knowledge

AI jailbreaks and prompt leakage are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are governing model-enabled workflows or agentic access paths, it is worth exploring.

This post draws on content published by ZioSec: Exploring AI Jailbreaks: Techniques and Risks in Foundation Models. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-01-14.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org