AI agent safety testing exposes the limits of enterprise auth

By NHI Mgmt Group Editorial TeamPublished 2025-11-07Domain: Agentic AI & NHIsSource: WorkOS

TL;DR: Haize Labs’ automated red-teaming platform targets prompt injection, goal misalignment, hallucination, and other behavioral failures in LLMs and AI agents, while also reporting a $100M post-money valuation and 38x faster attack generation with 4x less GPU memory use. The real lesson is that safety testing and enterprise authentication solve different problems: one checks behaviour, the other governs access.

At a glance

What this is: Haize Labs focuses on automated AI safety testing for LLMs and agents, showing that behavioural risk now needs testing at scale rather than only human review.

Why it matters: IAM teams need to separate model safety from identity control, because authenticated access does not prevent unsafe agent behaviour once a user or system is inside the boundary.

By the numbers:

Haize Labs received a $100M post-money valuation, reflecting investor confidence in the growing need for AI safety testing.
Cascade delivers 38x faster attack generation with 4x reduction in GPU memory usage.
Verdict showed +14.5% improvement over GPT-4o on hallucination benchmarks.

👉 Read WorkOS' analysis of Haize Labs AI safety testing for production AI

Context

AI agent safety testing sits in a different control plane from authentication and authorisation. The primary issue is behavioural safety, meaning whether a model or agent can be pushed into unsafe, misleading, or policy-breaking outputs even after the user or workload has already passed access controls.

That distinction matters for NHI and agentic AI programmes because identity controls decide who gets in, while red-teaming and validation decide how the system behaves once it is already running. For teams treating AI as just another application layer, the failure is assuming login security can substitute for runtime behaviour assurance.

Key questions

Q: How should security teams govern AI agents that can produce unsafe outputs after login?

A: Security teams should govern AI agents with two separate controls: identity access and behavioural assurance. Authentication, SSO, RBAC, and provisioning decide who can use the system. Automated red-teaming and monitoring decide whether the system behaves safely once it is used. Both are required because a correctly authenticated agent can still generate unsafe, misleading, or policy-breaking outcomes.

Q: Why do AI agents create risk even when identity controls are in place?

A: AI agents create risk because identity controls only validate the requester, not the runtime behaviour of the model or agent. Once access is granted, prompt injection, goal drift, hallucination, or adversarial inputs can still produce unsafe outputs. That means the security problem shifts from access legitimacy to behaviour stability, which requires testing and monitoring beyond IAM.

Q: What do security teams get wrong about AI safety testing?

A: The common mistake is treating AI safety testing as if it were just another security scan. It is not. Safety testing is about proving how a model or agent fails under pressure, while traditional security tooling is about who can access the system. Those are different governance questions and need different evidence.

Q: What is the difference between enterprise authentication and AI safety validation?

A: Enterprise authentication proves identity and controls entry. AI safety validation proves the model or agent behaves acceptably once entry has already been granted. Authentication supports trust at the boundary, while safety validation supports trust in the runtime behaviour inside that boundary. Mature programmes need both, not one as a substitute for the other.

Technical breakdown

Automated multi-turn red-teaming for AI agents

Automated red-teaming uses adversarial prompts, often chained across several turns, to probe where a model or agent fails under pressure. Multi-turn testing matters because many AI failures emerge only after the system has been nudged, contradicted, or redirected over time. Tree-search style methods can explore larger attack spaces than a human tester can manually cover, especially when the goal is to surface prompt injection, goal drift, or unsafe tool use. The technical value is not just finding one bad output, but mapping the conditions that make the failure repeatable across sessions and deployments.

Practical implication: treat multi-turn adversarial testing as a release gate for any AI feature that can influence decisions or actions.

Why safety testing is not authentication

Authentication proves a requester is allowed to use a system. Safety testing asks whether the system itself will still behave acceptably after that request is granted. Those are separate questions, and AI introduces a gap between them because the same authenticated session can produce unsafe outcomes through prompt injection, hallucination, or goal misalignment. In other words, the identity layer can be correct while the model layer is still unstable. That is why behavioural validation belongs alongside enterprise auth, not inside it, and why teams need a testing layer that evaluates outputs, not just access paths.

Practical implication: do not treat SSO, SCIM, or RBAC as substitutes for red-team validation of model behaviour.

Continuous monitoring for model drift and adversarial exposure

Continuous monitoring extends red-teaming into production by watching for behavioural degradation after deployment. AI systems change as prompts, retrieval sources, model versions, and guardrails change, so a model that passed testing at launch can fail later under different conditions. Monitoring tools are most useful when they tie production observations back to specific failure modes discovered during testing, such as harmful output generation or leakage of sensitive information. That creates a feedback loop between evaluation and remediation, rather than relying on one-time approval before rollout.

Practical implication: connect monitoring alerts to the specific failure modes you have already proven in pre-production tests.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Behavioural safety testing is becoming a parallel control to IAM, not a replacement for it. The article makes clear that the vendor sits in the validation layer, not the identity layer. That matters because enterprises are now building AI systems where the access decision may be correct while the runtime behaviour is still unsafe. Practitioners should stop treating access governance and behavioural assurance as the same control family.

AI agent safety creates a governance split between who can act and how the actor behaves. Traditional IAM answers the first question through authentication, authorisation, and lifecycle controls. Automated red-teaming answers the second by exposing unsafe outputs, goal drift, and adversarial susceptibility before production. The implication is that AI programmes need separate evidence for access legitimacy and behaviour stability.

Model-level testing is where NHI and agentic risk start to converge. As AI agents gain tool use and decision latitude, the security problem stops looking like static application security and starts looking like non-human identity governance with behavioural failure modes. That is why the article’s focus on agent testing matters for identity teams, not just ML teams. Practitioners should align testing, access, and monitoring around the same AI system boundary.

Operational AI safety depends on proving negative behaviour, not asserting policy intent. A system can be designed with guardrails and still fail under chained prompts or adversarial inputs. The article’s emphasis on automated red-teaming reinforces a broader field lesson: security claims about AI need empirical failure testing, because policy statements alone do not demonstrate safe operation. Practitioners should require evidence of what the model does when stressed, not only how it was configured.

Enterprise adoption will increasingly depend on whether behavioural testing and identity control are paired. The article implicitly shows that production AI buyers need both enterprise access governance and runtime safety evidence before they can trust a system. That creates a new expectation for platform teams: the identity boundary must be strong, and the agent behaviour inside that boundary must be continuously proven safe. Practitioners should plan for both controls as part of the same approval path.

From our research:
When AWS credentials are exposed publicly, attackers attempt access within an average of 17 minutes and as quickly as 9 minutes in some cases, according to LLMjacking: How Attackers Hijack AI Using Compromised NHIs.
DeepSeek accidentally embedded over 11,000 secrets in its training data and left a database exposed online, revealing more than one million sensitive records including chat histories, backend credentials, and API keys.
That exposure pattern makes 52 NHI Breaches Analysis useful for teams trying to separate access control failure from runtime behaviour failure in AI systems.

What this signals

AI safety programmes are moving toward dual control, where identity governance and behavioural assurance have to be evaluated together. A model or agent can be correctly authenticated and still be unsafe in production, which means access reviews alone will not satisfy operational risk owners. Teams should expect AI approval processes to look more like a combination of IAM, QA, and adversarial testing than like classic application onboarding.

With 48% of companies unable to track and audit the data their AI agents access, the governance gap is already structural. That blind spot changes how practitioners should plan evidence collection, because auditability is now part of the security baseline for agentic systems. For implementation teams, the question is not whether to test safety, but how to make testing and monitoring part of the release path.

Behavioural safety testing will increasingly sit beside NHI governance as a named control family. The more AI systems act like non-human identities with tool access and runtime decision-making, the more practitioners need evidence that both the account and the behaviour are governed. That points teams toward frameworks such as the OWASP Agentic AI Top 10 and the NIST AI Risk Management Framework.

For practitioners

Separate access approval from behaviour approval Require distinct sign-off for enterprise identity controls and for AI safety validation. A successful login or provisioned service account should never be treated as evidence that the model or agent behaves safely under adversarial prompting.
Add multi-turn red-teaming to release gates Test chained prompts, prompt injection, and goal drift before any model or agent reaches production. Focus on the specific workflows where unsafe output would create business, compliance, or customer harm, and block release until failure modes are documented.
Tie monitoring to known failure modes Instrument production AI systems so alerts map back to the exact behaviours uncovered in pre-production testing, such as leakage, hallucination, or unsafe action selection. That makes runtime telemetry useful for remediation instead of just reporting noise.
Keep enterprise auth in the same rollout plan Use SSO, directory sync, and admin workflows to control who can reach the AI system, then layer safety testing on top. Access governance without behavioural validation leaves a false sense of control, especially in customer-facing AI applications.

Key takeaways

AI safety testing and enterprise authentication solve different problems, so one cannot substitute for the other.
Automated red-teaming matters because AI failure often emerges only under chained prompts, adversarial inputs, or runtime drift.
Practitioners should govern AI systems with separate evidence for access, behaviour, and monitoring before production approval.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A1	Automated red-teaming maps directly to agent misuse and unsafe behaviour risks.
OWASP Non-Human Identity Top 10	NHI-03	AI agents using enterprise access need lifecycle and privilege controls.
NIST AI RMF		Behavioural assurance and governance fit the AI RMF risk-management model.

Treat AI agents as governed non-human identities and review access, scope, and monitoring together.

Key terms

Automated red-teaming: Automated red-teaming is the use of adversarial test generation to find how an AI model or agent fails under pressure. It goes beyond manual review by systematically probing prompt injection, goal drift, unsafe outputs, and other repeatable behavioural weaknesses before production use.
Behavioural safety: Behavioural safety is the property of an AI system continuing to act within acceptable bounds when it is prompted, challenged, or manipulated in unexpected ways. In practice, it is validated through testing and monitoring, not assumed from the presence of access controls or policy text.
Runtime assurance: Runtime assurance is the evidence that a deployed system still behaves as intended after it has been released. For AI agents, this means watching for drift, unsafe outputs, and policy violations during live use, because pre-production approval alone does not guarantee safe operation.
Agentic risk: Agentic risk is the security and governance exposure created when an AI system can make decisions, use tools, or take actions with limited human intervention. The risk is not only access to data, but the possibility that the system will pursue an unsafe path once it has access.

Deepen your knowledge

AI safety testing for LLMs and agents is covered in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building AI governance around access control and runtime assurance together, it is worth exploring.

This post draws on content published by WorkOS: Haize Labs: AI Safety Testing Haize Labs for AI Agent Security: Features, Pricing, and Alternatives. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-11-07.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org