Adversarial poetry exposes fragile AI agent safety controls

By NHI Mgmt Group Editorial TeamPublished 2025-12-15Domain: Agentic AI & NHIsSource: ZioSec

TL;DR: New research on 25 language models found that poetic rewrites of harmful prompts raised jailbreak success from single digits to more than 40%, and in some cases above 60%, showing that style can defeat safety filters across proprietary and open-weight systems, according to ZioSec's summary of the Arxiv paper. The real risk is not poetry itself but the assumption that AI safety controls will generalize across form, especially once enterprise agents can act on bypassed output.

At a glance

What this is: This research shows that poetic prompt reformulation can materially increase jailbreak success across modern language models, exposing a structural weakness in AI safety filtering.

Why it matters: It matters because enterprise AI agents can turn a single stylistic bypass into real-world tool use, data exposure, or unsafe workflow execution across identity and governance programmes.

👉 Read ZioSec's analysis of adversarial poetry as a jailbreak vector

Context

Adversarial poetry is a stylistic jailbreak technique that changes the form of a prompt without changing its intent. The article argues that current AI safety controls still lean heavily on surface cues such as keywords, direct imperatives, and familiar harmful phrasing, which makes them brittle when the same request is disguised as verse or metaphor.

For identity and access teams, the issue is not simply content moderation. Once an AI agent can ingest a bypassed prompt, it may query internal systems, summarise sensitive data, or trigger downstream actions inside the enterprise control plane. That shifts the problem from model refusal rates to governance of agent behaviour, tool access, and containment boundaries.

The article's starting position is typical for enterprise AI deployments: models are being protected as if attackers will keep using obvious prose. The paper shows that assumption is already outdated.

Key questions

Q: How should security teams test AI agents for jailbreak resilience?

A: Security teams should test agents with both direct harmful prompts and stylistic variants that preserve intent while changing structure. The goal is to see whether the safety layer blocks only obvious phrasing or actually resists adversarial presentation. Include poetry, fiction, metaphor, and nested instructions in red-team suites, then verify that tool access stays constrained even if the model responds unsafely.

Q: Why do AI agents create more risk than chatbots when jailbreaks succeed?

A: AI agents can move from unsafe text to unsafe action because they are connected to tools, data sources, and workflows. A jailbreak that only changes the model's output is bad enough, but an agent can translate that output into retrieval, reporting, or execution. The result is a wider blast radius, especially when permissions are not tightly separated.

Q: What do teams get wrong about prompt injection and safety controls?

A: Teams often assume that if a model rejects direct harmful requests, it is safe enough. That misses the main weakness, which is that the same request can be disguised in ways the filter does not recognise. Effective controls have to account for intent, context, and structure, not just keyword matching or canonical attack templates.

Q: Who is accountable when a bypassed AI prompt triggers an enterprise action?

A: Accountability should sit with the team that authorized the agent's access and execution paths, not with the prompt alone. If the agent can query systems or trigger workflows, governance must define who owns those permissions, who reviews them, and which controls stop an unsafe response from becoming an operational event.

Technical breakdown

Why stylistic jailbreaks defeat safety filters

Modern alignment systems often score prompts using visible cues such as harmful keywords, imperative verbs, and known attack templates. Adversarial poetry preserves the underlying intent but changes syntax, rhythm, and framing, which can reduce the chance that the request is classified as unsafe. The model still understands the meaning, but the safety layer may not recognise the request as disallowed. That is why this is a form problem as much as a content problem. The weakness is especially visible in single-turn attacks, where there is no conversation history to recover context.

Practical implication: test safety controls against stylistic variants, not just canonical malicious prompts.

Why enterprise AI agents amplify the impact

A chatbot jailbreak is concerning; an agent jailbreak is operational. Enterprise agents do not just respond, they ingest email, documents, tickets, and internal data, then may call tools or trigger workflows. If a poisoned prompt survives ingestion, the model can act on it inside a broader execution chain. The article's warning is that a bypass does not stay at the model boundary. It can cascade into data access, report generation, or automation steps with business impact. That changes the control requirement from “reject bad text” to “contain untrusted instructions.”

Practical implication: isolate high-risk agent actions behind deterministic approval and policy gates.

Why form-shifting attacks belong in the agentic AI threat model

Form-shifting attacks are adversarial inputs that keep meaning stable while changing structure enough to evade filters. This is broader than poetry. Riddles, allegories, fictional framing, and nested narratives can all exploit the same weakness if the safety layer is too dependent on surface patterns. For agentic systems, that means prompt injection is not limited to obvious malicious text. The threat model has to include syntactic disguise, context smuggling, and multi-stage instruction embedding across documents and messages.

Practical implication: add adversarial red-team cases that vary structure, not just payload content.

Threat narrative

Attacker objective: The attacker wants to smuggle unsafe instructions past safety controls and use the agent's own permissions to produce harmful or unauthorized outcomes.

Entry occurs when a disguised prompt is delivered inside ordinary enterprise content such as an email, document, or ticket.
Escalation happens when the model interprets the poetic form as benign and bypasses its refusal behavior, allowing unsafe instructions to pass into the agent loop.
Impact follows when the agent acts on the compromised output, potentially querying internal systems or triggering downstream workflows with real operational consequences.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Style-aware jailbreak resistance is now an identity governance problem, not just a model-safety problem. Once a prompt can be disguised well enough to pass the first inspection layer, the security question shifts to what the agent is allowed to reach after ingestion. That means governance has to cover the actor, its tools, and the data plane it can touch. The practitioner conclusion is that prompt screening alone is not a boundary.

Unsafe instruction filtering assumes the attacker will speak plainly. That assumption fails when the actor can reshape the same intent into verse, metaphor, or other non-canonical forms. The implication is that safety evaluations built around direct harmful phrasing overstate real protection, especially for enterprise agents embedded in workflows. Practitioners should treat stylistic variance as a core part of the threat model.

Adversarial poetry is a named example of the form-shift bypass class. The article demonstrates that the control gap is not limited to one quirky prompt style. It reveals a broader brittleness in systems that equate semantic understanding with policy compliance. For the field, the lesson is that AI safety controls need to be measured against adversarial presentation, not just adversarial intent.

Enterprise AI programmes must separate model correctness from operational trust. A model can produce accurate language and still be unsafe if the surrounding agent is authorized to act on it. That distinction matters across NHI, human, and autonomous governance because the execution layer determines the blast radius. The practitioner conclusion is to evaluate trust at the point of action, not only at the point of text generation.

OWASP Agentic AI Top 10 risk thinking fits this pattern because the weakness sits in prompt handling and tool misuse. The article reinforces that agentic systems inherit identity and privilege risk once they can be influenced through untrusted inputs. Security teams should treat stylized prompt bypasses as a governance signal, not an anomaly to be dismissed after the fact. The practitioner conclusion is to test the full control chain, not just the model card.

From our research:
96% of technology professionals identify AI agents as a growing security threat, and 66% believe this risk is immediate, according to AI Agents: The New Attack Surface report.
Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation.
Read OWASP Agentic AI Top 10 for the control patterns that matter when prompt handling and tool use become part of the attack surface.

What this signals

Form-shifted jailbreaks should push teams to treat input style as a governance signal, not a curiosity. If your evaluation only covers direct harmful prose, you are measuring the wrong boundary. The next step is to align prompt testing with the OWASP Agentic AI Top 10 and test whether agent permissions stay safe when the model sees disguised instructions.

Adversarial prompt design is an attack class that will keep evolving faster than static filters. The lesson for practitioners is to move from text screening to containment. That means tighter tool authorization, better logging, and clearer separation between model output and execution rights.

With 80% of organisations reporting agent behaviour beyond intended scope in NHIMG research, the governance problem is already operational, not hypothetical. The evidence is strongest when you combine that finding with agent-level auditing and prompt classification across intake channels, because the weak point is often the handoff into the workflow rather than the model response itself.

For practitioners

Test against stylistic prompt variants Build red-team cases that restate the same harmful request as poetry, allegory, fiction, and nested narrative. Measure whether the safety layer still blocks the intent when the surface form changes.
Isolate high-risk agent actions Keep tool calls, data retrieval, and workflow execution behind deterministic policy checks so a bypassed prompt cannot directly trigger sensitive operations.
Add adversarial content to evaluation suites Include form-shifting prompts in routine testing, alongside direct jailbreaks, so model validation reflects how attackers actually disguise intent.
Review agent authorization boundaries Confirm that agents cannot reach systems, datasets, or delegated workflows that would turn a single unsafe response into a broader operational incident.

Key takeaways

Poetic rewrites can bypass modern AI safety filters because they preserve intent while changing the surface structure that defenders inspect.
Enterprise AI agents raise the stakes because a single bypassed prompt can flow into tool access, data retrieval, or workflow execution.
Security teams need adversarial testing, tighter execution boundaries, and governance that measures trust at the point of action, not just the point of text generation.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A1	Covers prompt injection and tool misuse in agentic systems.
NIST AI RMF		Addresses governance and risk management for AI systems.
NIST Zero Trust (SP 800-207)	PR.AC-4	Agent actions should be constrained by continuous authorization and least privilege.

Define ownership, testing, and monitoring for AI behaviour that can affect enterprise actions.

Key terms

Adversarial Prompt: A prompt designed to alter model behaviour in a harmful or unintended way. In enterprise settings, it often hides the real request inside normal-looking text so filtering and review controls are less likely to catch it before the model acts.
Prompt Injection: A technique that embeds instructions inside content the model processes, such as emails, documents, or tickets, so the model follows attacker intent instead of intended policy. For agents, the concern is not only text generation but the downstream actions the model can trigger.
Agentic AI: An AI system that can decide what to do next, choose tools, and execute actions in a runtime loop. The governance challenge is that its behaviour can change during operation, which makes access control, monitoring, and containment more important than static prompt filtering alone.
Safety Filter: A control layer intended to block harmful or disallowed model outputs before they are returned or acted upon. These controls are useful but fragile when attackers change the form of a request without changing its meaning, which is why evaluation must include stylistic adversarial testing.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or governance in your organisation, it is worth exploring.

This post draws on content published by ZioSec: Adversarial Poetry and the Hidden Fragility of AI Safety. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-12-15.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org