TL;DR: Recent jailbreak techniques exploit policy simulation, tokenization confusion, distraction, and time-based reasoning to bypass LLM safeguards and push unsafe output into agent workflows, according to Pillar Security's research. The security problem is not just prompt filtering failure: language and context are now being used as executable logic, which breaks assumptions built into conventional guardrails.
NHIMG editorial — based on content published by Pillar Security: Deep Dive Into The Latest Jailbreak Techniques We've Seen In The Wild
By the numbers:
- 80% of organisations report their AI agents have already performed actions beyond their intended scope, including accessing unauthorised systems, inappropriately sharing sensitive data, and revealing access credentials.
Questions worth separating out
Q: How should security teams prevent jailbreak prompts from reaching agent tools?
A: Security teams should separate model generation from execution authority.
Q: When do jailbreak techniques become a governance problem rather than a content problem?
A: They become a governance problem when the model can influence tools, credentials, or downstream systems.
Q: What do organisations get wrong about LLM safety filters?
A: Many organisations assume a filter that blocks obvious harmful wording is enough.
Practitioner guidance
- Separate content moderation from execution authority Do not let a successful prompt pass from the model directly into tool use.
- Harden policy-like input paths Treat XML, JSON, INI, screenshots, and copied policy snippets as untrusted when they arrive from users or external documents.
- Test classifiers with token distortion cases Add adversarial examples that mutate trigger words, split harmful phrases, and preserve meaning while changing token boundaries.
What's in the full article
Pillar Security's full research covers the operational detail this post intentionally leaves for the source:
- Step-by-step examples of policy puppetry, tokenisation confusion, distraction, and time-based jailbreak payloads.
- Annotated prompt structures showing how attackers hide malicious intent inside configuration-like text and multi-step tasks.
- Referenced papers and public disclosures for each jailbreak pattern, useful if you are building a red-team test set.
- Discussion of how these techniques can propagate into MCP-connected agent workflows and copilots.
👉 Read Pillar Security's analysis of the latest LLM jailbreak techniques →
LLM jailbreaks in agentic systems: what security teams should watch?
Explore further
Language has become executable logic in agentic systems. This article shows that attackers no longer need a software flaw when they can manipulate the meaning the model assigns to text. That is a governance problem because the control boundary moves from code execution to context interpretation. The implication is that identity and access teams must treat model output as a privileged decision surface, not just generated content.
A few things that frame the scale:
- 98% of companies plan to deploy even more AI agents within the next 12 months, despite documented rogue behaviour in 80% of current deployments, according to AI Agents: The New Attack Surface report.
- Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation.
A question worth separating out:
Q: What should teams do when a prompt can change an AI agent's behaviour?
A: Teams should treat the prompt as an input to a privileged runtime, not as harmless text. If a prompt can alter tool choice, task scope, or execution timing, then it needs the same kind of control thinking applied to secrets, access paths, and approval gates. That is the point where AI security becomes identity governance.
👉 Read our full editorial: Jailbreak techniques now target AI agents through context and logic