TL;DR: A 42.2% bypass rate across 1,500 adversarial prompts was found in a red-team test of AprielGuard, with 232 successful jailbreaks and more than 1,000 custom policy violations slipping through, according to Lasso Security. Lightweight guardrails cannot be treated as a primary control layer without continuous adversarial testing and layered defenses.
NHIMG editorial — based on content published by Lasso Security: When Model Guardrails Break: How AprielGuard Performed Against 1,500 Adversarial Attacks
By the numbers:
- Lasso Security found a 42.2% bypass attack rate across 1,500 adversarial prompts.
- The team observed 232 jailbreak prompts that successfully bypassed the guardrail.
Questions worth separating out
Q: How should security teams test AI guardrails before deployment?
A: Test guardrails with adversarial variation, not just known-bad prompts.
Q: When does a guardrail create more confidence than protection?
A: A guardrail creates false confidence when teams treat classification as enforcement and assume feature breadth equals resilience.
Q: What do teams get wrong about prompt injection defence?
A: Teams often test for obvious malicious wording instead of attacker intent preserved through obfuscation or encoding.
Practitioner guidance
- Red-team guardrails with transformed attack variants Test prompt injection, jailbreak, and policy bypass attempts using obfuscation, encoding, and role-framing so the control is evaluated against attacker adaptation rather than one fixed payload.
- Require policy-specific enforcement evidence Map each safety category and custom rule to a test case, then verify that the model blocks it consistently under load, retries, and variation instead of only in isolated benchmark runs.
- Add execution telemetry beside prompt filtering Log tool calls, reasoning traces, and downstream actions so security teams can see whether a control failure happened at input, classification, or runtime execution.
What's in the full report
Lasso Security’s full research covers the operational detail this post intentionally leaves for the source:
- The test setup used to isolate AprielGuard in a minimal inference environment and expose it over HTTP
- The top detection techniques and which obfuscation methods performed best or worst against the model
- The category-by-category bypass analysis, including hate-related prompts, sexual content, and custom policy rules
- The red-team methodology used to generate adversarial prompt variants at scale
👉 Read Lasso Security’s analysis of AprielGuard bypass testing and adversarial attacks →
AprielGuard bypass rates: are AI guardrails keeping up?
Explore further