By NHI Mgmt Group Editorial TeamPublished 2026-04-28Domain: Agentic AI & NHIsSource: Lasso Security

TL;DR: A 42.2% bypass rate across 1,500 adversarial prompts was found in a red-team test of AprielGuard, with 232 successful jailbreaks and more than 1,000 custom policy violations slipping through, according to Lasso Security. Lightweight guardrails cannot be treated as a primary control layer without continuous adversarial testing and layered defenses.


At a glance

What this is: Lasso Security’s testing shows AprielGuard missed 42.2% of adversarial prompts, exposing a gap between safety taxonomy coverage and real attack resistance.

Why it matters: IAM, NHI, and autonomous AI programmes should treat guardrails as one control in a layered defence model, not as a substitute for governance, testing, and fallback controls.

By the numbers:

👉 Read Lasso Security’s analysis of AprielGuard bypass testing and adversarial attacks


Context

AI guardrails are control layers that try to detect or block unsafe prompts, jailbreaks, and policy violations before they reach downstream model logic. This article is really about the gap between advertised safety coverage and adversarial resilience, especially when a guardrail is expected to stand in front of an AI workflow by itself.

For IAM teams, the key issue is not whether a model can classify risky text, but whether the control can hold under manipulation, obfuscation, and repeated attack attempts. That matters across autonomous AI, NHI-adjacent workflows, and human-facing AI systems because the control failure is the same: a weak front door creates false confidence in the rest of the programme.

The source testing is a typical example of why benchmark scores and feature lists do not settle governance questions. Real attacker behaviour, not model positioning, is what should decide whether a guardrail belongs in the control stack.


Key questions

Q: How should security teams test AI guardrails before deployment?

A: Test guardrails with adversarial variation, not just known-bad prompts. Include obfuscation, encoding, role-play, and multi-step jailbreak patterns, then measure whether the control still blocks the request under repeat attempts and operational load. A guardrail that only performs in benchmark conditions is not ready to serve as the primary enforcement layer.

Q: When does a guardrail create more confidence than protection?

A: A guardrail creates false confidence when teams treat classification as enforcement and assume feature breadth equals resilience. That risk is highest when one model is expected to stop prompt injection, jailbreaks, and policy violations on its own. If adversaries can bypass it with minor prompt changes, the control is advisory, not protective.

Q: What do teams get wrong about prompt injection defence?

A: Teams often test for obvious malicious wording instead of attacker intent preserved through obfuscation or encoding. That misses how real adversaries work. Effective defence requires coverage of transformed prompts, policy-specific test cases, and telemetry that shows where the failure happened, not just whether the model answered safely.

Q: Should AI workflow monitoring replace prompt filtering?

A: No. Prompt filtering and workflow monitoring solve different problems. Prompt filtering tries to stop malicious input, while workflow monitoring looks at tool calls, reasoning traces, and actions after the prompt is accepted. Organisations need both if they want visibility into whether an AI system is merely classifying text or actually executing unsafe behaviour.


Technical breakdown

Why prompt injection defeats single-layer guardrails

Prompt injection works by reshaping the model’s interpretation of the input so the attacker’s intent survives even when the words are disguised. That can happen through obfuscation, encoding, role-play framing, or instructions embedded inside seemingly normal text. A guardrail model is only as strong as its ability to recognise intent across those transformations. When a single classifier is used as the primary enforcement layer, attackers only need one effective variant to slip through. The article’s bypass results show that detection can look strong on obvious payloads while failing under adversarial variation.

Practical implication: test guardrails against transformed prompts, not just obvious malicious strings.

How jailbreaks and policy bypasses create enforcement drift

Jailbreaks are attempts to override a model’s safety policy by coercing it into following the attacker’s framing instead of the system’s restrictions. Policy bypass is the broader failure mode where the control misclassifies or misses the request and allows behaviour that should have been blocked. In this article, the same guardrail also struggled with custom policy rules, which suggests enforcement drift between intended governance and actual runtime decisions. That gap matters because the control can appear to support a policy framework while still letting restricted content through at scale.

Practical implication: validate policy enforcement with red-team tests mapped to each safety category.

Why workflow monitoring matters more than prompt-only filtering

Workflow monitoring looks beyond the user prompt and inspects tool calls, reasoning traces, and execution behaviour inside an AI system. That matters because many attacks do not stop at the prompt layer. They aim to influence how the model behaves once it is already inside a workflow, where decisions, outputs, and tool usage can create downstream impact. Prompt-only filtering cannot see that full path. The article’s focus on agent workflow analysis reflects a broader control truth: the more an AI system can act, the more the guardrail must observe action, not just text.

Practical implication: place guardrails alongside execution logging and policy checks, not as a replacement for them.


Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.


NHI Mgmt Group analysis

Single-model guardrails create a false perimeter when the attacker can iterate. AprielGuard’s bypass rate shows that safety classification is not the same thing as enforcement. When a guardrail is positioned as the main protection layer, every missed prompt becomes a direct path to downstream model behaviour. Practitioners should read this as a perimeter failure, not a tuning issue.

Prompt obfuscation is not an edge case, it is the attack model. The article shows that simple encodings were blocked more reliably than more complex variants such as Base32 and ASCII85. That means attackers do not need exotic tooling to find a path through, only enough variation to force inconsistent detection. The practical conclusion is that red-team coverage must follow attacker adaptation, not vendor taxonomy.

Custom policy enforcement is only real if the model can sustain it under stress. More than 1,000 policy violations slipping through means the governance promise and the runtime control diverged. This is the kind of failure that creates audit risk, because policy existence is not policy enforcement. Teams should assume that written controls are weak evidence until adversarial testing proves otherwise.

Workflow-level monitoring is becoming the control boundary for AI security. A guardrail that can inspect tool calls and reasoning traces is closer to the actual risk surface than one that only inspects prompts. That matters for AI systems with operational side effects, where the harm comes from what the system does after classification. The field should move from prompt moderation to behaviour-aware oversight.

Model benchmark claims should be treated as input, not assurance. A broad safety taxonomy can still leave large bypass windows when the model meets real adversarial conditions. The named concept here is guardrail bypass debt: the gap between advertised safety coverage and the number of successful attack variants that still work in practice. Practitioners should size controls around adversarial resilience, not feature breadth.

From our research:

  • 85% of organisations lack full visibility into third-party vendors connected via OAuth apps, according to The State of Non-Human Identity Security.
  • Only 1.5 out of 10 organisations are highly confident in their ability to secure NHIs, compared with nearly 1 in 4 for securing human identities.
  • The next question is how AI guardrails and NHI governance converge, which is why practitioners should also review Top 10 NHI Issues for control prioritisation.

What this signals

Guardrail bypass is becoming a governance problem, not just a model-quality problem. Security teams should expect attackers to probe the mismatch between declared safety categories and actual blocking behaviour, especially where one layer is asked to carry the whole control burden. The practical signal is simple: if a control cannot survive variation, it should not sit at the centre of the operating model.

Guardrail bypass debt: the gap between broad safety claims and the number of attack variants that still work in practice. That gap matters because AI systems increasingly sit inside workflows where the real risk is downstream action, not just unsafe text.

Practitioners should align testing with adversarial AI threat modelling and execution monitoring, then use those results to decide whether the guardrail belongs at the edge, in the middle, or only as a detection signal.


For practitioners

  • Red-team guardrails with transformed attack variants Test prompt injection, jailbreak, and policy bypass attempts using obfuscation, encoding, and role-framing so the control is evaluated against attacker adaptation rather than one fixed payload.
  • Require policy-specific enforcement evidence Map each safety category and custom rule to a test case, then verify that the model blocks it consistently under load, retries, and variation instead of only in isolated benchmark runs.
  • Add execution telemetry beside prompt filtering Log tool calls, reasoning traces, and downstream actions so security teams can see whether a control failure happened at input, classification, or runtime execution.
  • Treat guardrails as one layer in a broader control stack Place separate controls around data access, tool permissions, and model output handling so one classifier failure does not become a complete AI workflow compromise.

Key takeaways

  • AprielGuard’s test results show that a wide safety taxonomy does not guarantee strong adversarial resilience.
  • The 42.2% bypass rate and 232 successful jailbreaks show that prompt-level defence can fail at operational scale.
  • Teams should validate guardrails as one layer in a broader AI control stack, not as the primary security boundary.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
OWASP Agentic AI Top 10Prompt injection and jailbreak resilience are core agentic AI risks in this article.
NIST AI RMFThe article tests whether the AI control layer is reliable under adversarial conditions.
NIST CSF 2.0DE.CMWorkflow monitoring and detection failures map to continuous security monitoring.

Map guardrail testing to agentic AI abuse cases and verify controls against prompt manipulation.


Key terms

  • Guardrail: A guardrail is a control layer that tries to detect, block, or constrain unsafe AI behaviour before it reaches users or downstream systems. In practice, it may inspect prompts, outputs, policy violations, or internal workflow activity, depending on how much of the model’s execution path it can observe.
  • Prompt Injection: Prompt injection is an attack that manipulates an AI system by embedding instructions that change how the model interprets or follows a request. The attacker’s goal is to override intended policy or steer behaviour through the input channel, often using obfuscation, role-play, or hidden instructions.
  • Jailbreak: A jailbreak is a prompt attack that coerces a model into ignoring its safety constraints and producing behaviour it should refuse. The control failure is not only in content moderation, but in enforcement under pressure, where the model follows attacker framing instead of policy.
  • Workflow Monitoring: Workflow monitoring is the inspection of AI execution beyond the prompt, including tool calls, reasoning traces, and actions taken during a session. For AI systems that can act, this is the difference between seeing what was asked and understanding what the system actually did.

Deepen your knowledge

AI guardrail testing and adversarial validation are covered in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are defining how AI protections fit into a broader identity control model, it is a useful place to start.

This post draws on content published by Lasso Security: When Model Guardrails Break: How AprielGuard Performed Against 1,500 Adversarial Attacks. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-04-28.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org