What do security teams get wrong about AI safety testing?

Why Security Teams Misread AI Safety Testing

Security teams often approach AI safety testing as if it were a vulnerability scan, but the actual question is broader: can the model or agent be pushed into unsafe, misleading, or policy-breaking behavior under realistic pressure? That means testing must examine prompt injection, jailbreak resistance, tool misuse, policy evasion, and harmful output patterns, not just access control. Traditional controls like NIST Cybersecurity Framework 2.0 remain useful, but they do not replace behavioral testing.

This is where teams often lose coverage. A system can pass identity checks and still fail safety testing if the model can be manipulated through untrusted content, indirect prompt injection, or compromised context. NHIMG research on the DeepSeek breach and the Microsoft Azure OpenAI service breach shows how quickly AI risk shifts from simple access questions to misuse of the model surface itself. In practice, many security teams discover safety failures only after a model has already been embedded into production workflows and exposed to real user inputs.

How Safety Testing Actually Works

Effective AI safety testing starts with a threat model for the model or agent’s behavior. The goal is to prove how the system responds when it encounters adversarial prompts, conflicting instructions, unsafe tool requests, and context contamination. That is different from verifying whether a user is allowed to log in. Current guidance suggests combining red teaming, automated evals, and policy-based review so that tests reflect the actual ways a model can fail.

Practitioners should test across several layers:

Prompt and instruction hierarchy: whether the model follows system policy over user content.

Tool use: whether the model can be induced to call APIs, retrieve data, or take actions it should not.

Context boundaries: whether retrieved content, files, or chat history can override safety constraints.

Output handling: whether unsafe content is blocked, transformed, or escalated as intended.

Regression testing: whether fixes remain effective after model, prompt, or tool changes.

For agentic systems, this becomes harder because the risk is not only what the model says, but what the agent does. A model that can browse, write files, trigger workflows, or chain tools needs runtime guardrails and continuous testing, not a one-time review. NHI governance research at The State of Non-Human Identity Security shows how identity and permission mistakes compound once machine-driven execution is involved. Best practice is evolving toward context-aware testing, policy-as-code, and evidence that the system behaves safely under realistic operational conditions. These controls tend to break down when safety testing is limited to static prompts and ignores the agent’s live toolchain, because the failure emerges only during multi-step execution.

Where the Test Strategy Usually Breaks Down

Tighter safety testing often increases build time and review overhead, requiring organisations to balance release speed against confidence in model behavior. That tradeoff is real, especially when product teams want to ship quickly and security teams are asked to validate every prompt path. There is no universal standard for this yet, so teams should be explicit about whether they are testing the model, the prompt, the agent, or the surrounding controls.

Common edge cases include vendor-hosted models, retrieval-augmented generation, and multi-agent workflows. In those environments, a test that looks good in a sandbox may fail once external content, third-party tools, or chained agents enter the loop. Current guidance suggests treating safety testing as an ongoing control rather than a gate at launch. One practical way to do that is to keep a small, repeatable suite of adversarial tests that runs whenever prompts, policies, model versions, or tools change. The biggest mistake is assuming that a secure deployment is automatically a safe one, because access control can be correct while behavior remains unsafe. That gap is especially visible when teams validate permissions but never validate how the system handles deceptive or malformed inputs.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Agent safety testing must cover prompt injection, tool misuse, and unsafe outputs.
CSA MAESTRO		MAESTRO addresses governing agent behavior across tools, context, and execution paths.
NIST AI RMF		AI RMF governs evaluation of model risk, failure modes, and human oversight.

Build red-team tests for adversarial prompts, tool abuse, and policy evasion before production release.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

What do security teams get wrong about AI safety testing?

Why Security Teams Misread AI Safety Testing

How Safety Testing Actually Works

Where the Test Strategy Usually Breaks Down

Standards & Framework Alignment

Related resources from NHI Mgmt Group