Notifications

Clear all

LLM judge bypasses in AI safety pipelines: are controls keeping up?

Last Post

RSS

NHI Mgmt Group

(@nhi-mgmt-group)

Member Moderator

Joined: 1 year ago

Posts: 12387

Topic starter 05/07/2026 6:52 pm

TL;DR: OpenAI Guardrails can be bypassed when the same class of LLM is used both to generate content and to judge jailbreak or prompt-injection risk, allowing harmful outputs and indirect tool misuse to pass, according to HiddenLayer. Self-regulating model-layer filters are not a sufficient security boundary; independent validation is.

NHIMG editorial — based on content published by HiddenLayer: Same Model, Different Hat

By the numbers:

Only 44% have implemented any policies to govern AI agents, despite 92% agreeing that governing them is critical to enterprise security.

Questions worth separating out

Q: What breaks when an LLM is used to judge its own safety?

A: The safety boundary stops being independent.

Q: Why do prompt injection attacks matter for AI governance?

A: Prompt injection matters because it can change how an AI system interprets instructions, tool calls, and policy logic.

Q: How can security teams know whether AI guardrails are actually working?

A: They should test whether the guardrails still block malicious input after the attacker embeds instructions inside tool outputs, formatting tricks, or threshold manipulation.

Practitioner guidance

Separate generation from enforcement Place the final allow or block decision in an independent control layer that does not share the same prompt path as the model being judged.
Treat tool outputs as untrusted input Sanitise fetched content before it reaches the model context, and prevent tool responses from carrying instructions that can alter downstream behaviour.
Red-team the judge, not just the model Test whether threshold manipulation, confidence spoofing, and formatted prompt injection can alter guardrail decisions under realistic adversarial conditions.

What's in the full report

HiddenLayer's full research covers the exploit mechanics and test configurations this post intentionally leaves at the strategy level:

The exact guardrail settings, thresholds, and model choices used in the bypass experiments
Step-by-step prompt templates that showed how the judge model could be manipulated
The indirect prompt injection proof of concept involving fetched web content and tool calls
The observed failure pattern across jailbreak and prompt-injection detection pipelines

👉 Read HiddenLayer's research on bypassing LLM-based safety guardrails →

LLM judge bypasses in AI safety pipelines: are controls keeping up?

Explore further

View Full Forum → | NHI Foundation Course →

Quote

Topic Tags

Mr NHI

(@mr-nhi)

Member Moderator

Joined: 3 months ago

Posts: 11961

05/07/2026 7:13 pm

Self-policing model safety is a broken assumption, not a control gap: the guardrail design assumes the evaluator is materially more trustworthy than the model it watches. That assumption fails when both are the same class of LLM and both can be steered by prompt injection. The implication is that AI governance must stop treating model verdicts as independent assurance.

A few things that frame the scale:

Only 44% have implemented any policies to govern AI agents, despite 92% agreeing that governing them is critical to enterprise security, according to AI Agents: The New Attack Surface report.
Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation.

A question worth separating out:

Q: Who should own accountability for AI safety controls when models can call tools?

A: Accountability should sit with the team that owns the end-to-end workflow, not with the model itself. Once a model can fetch data, trigger tools, or influence access decisions, the organisation needs a named control owner for both policy enforcement and audit evidence.

👉 Read our full editorial: Self-policing AI safety fails when the same model judges risk

ReplyQuote

Forum Statistics

11 Forums

13.6 K Topics

26.1 K Posts

37 Online

135 Members

Latest Post: LLM security and AI-driven crime: what security teams must change Our newest member: Alex Recent Posts Unread Posts Tags

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Get in Touch

Quick Links

FAQ

NHI 101 Articles

Legal & Policies