TL;DR: OpenAI Guardrails can be bypassed when the same class of LLM is used both to generate content and to judge jailbreak or prompt-injection risk, allowing harmful outputs and indirect tool misuse to pass, according to HiddenLayer. Self-regulating model-layer filters are not a sufficient security boundary; independent validation is.
NHIMG editorial — based on content published by HiddenLayer: Same Model, Different Hat
By the numbers:
- Only 44% have implemented any policies to govern AI agents, despite 92% agreeing that governing them is critical to enterprise security.
Questions worth separating out
Q: What breaks when an LLM is used to judge its own safety?
A: The safety boundary stops being independent.
Q: Why do prompt injection attacks matter for AI governance?
A: Prompt injection matters because it can change how an AI system interprets instructions, tool calls, and policy logic.
Q: How can security teams know whether AI guardrails are actually working?
A: They should test whether the guardrails still block malicious input after the attacker embeds instructions inside tool outputs, formatting tricks, or threshold manipulation.
Practitioner guidance
- Separate generation from enforcement Place the final allow or block decision in an independent control layer that does not share the same prompt path as the model being judged.
- Treat tool outputs as untrusted input Sanitise fetched content before it reaches the model context, and prevent tool responses from carrying instructions that can alter downstream behaviour.
- Red-team the judge, not just the model Test whether threshold manipulation, confidence spoofing, and formatted prompt injection can alter guardrail decisions under realistic adversarial conditions.
What's in the full report
HiddenLayer's full research covers the exploit mechanics and test configurations this post intentionally leaves at the strategy level:
- The exact guardrail settings, thresholds, and model choices used in the bypass experiments
- Step-by-step prompt templates that showed how the judge model could be manipulated
- The indirect prompt injection proof of concept involving fetched web content and tool calls
- The observed failure pattern across jailbreak and prompt-injection detection pipelines
👉 Read HiddenLayer's research on bypassing LLM-based safety guardrails →
LLM judge bypasses in AI safety pipelines: are controls keeping up?
Explore further
Self-policing model safety is a broken assumption, not a control gap: the guardrail design assumes the evaluator is materially more trustworthy than the model it watches. That assumption fails when both are the same class of LLM and both can be steered by prompt injection. The implication is that AI governance must stop treating model verdicts as independent assurance.
A few things that frame the scale:
- Only 44% have implemented any policies to govern AI agents, despite 92% agreeing that governing them is critical to enterprise security, according to AI Agents: The New Attack Surface report.
- Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation.
A question worth separating out:
Q: Who should own accountability for AI safety controls when models can call tools?
A: Accountability should sit with the team that owns the end-to-end workflow, not with the model itself. Once a model can fetch data, trigger tools, or influence access decisions, the organisation needs a named control owner for both policy enforcement and audit evidence.
👉 Read our full editorial: Self-policing AI safety fails when the same model judges risk