What breaks when model-level guardrails are treated as security controls for AI systems?

Why This Matters for Security Teams

Model-level guardrails are designed to reduce unsafe outputs, but they do not enforce policy the way a security control must. That distinction matters because attackers do not need the model to “want” to be malicious. They only need to shape the prompt, split a task into smaller steps, or route action through a tool that the model is willing to use. Current guidance from the NIST Cybersecurity Framework 2.0 and NHI research such as Ultimate Guide to NHIs — Standards points toward runtime enforcement, identity, and authorization as the durable control points.

This is where teams often overestimate “safe model behavior” and underinvest in the surrounding control plane. A model can refuse an obvious harmful request and still comply with a reframed request that achieves the same outcome. It can also be induced to call tools, retrieve secrets, or pass data across boundaries if the surrounding system grants that authority. In practice, many security teams encounter data exfiltration only after a benign-seeming workflow has already executed, rather than through intentional policy testing.

How It Works in Practice

Security-grade control for AI systems has to sit in the runtime path, not in the model’s internal preferences. That means treating the model as one decision-making component, then surrounding it with policy checks that evaluate identity, context, action type, and data sensitivity before any tool call or external side effect occurs. The runtime layer should verify what the agent is allowed to do, not merely what the model is willing to say.

In practical terms, that usually means combining workload identity, short-lived secrets, and policy-as-code. The agent should authenticate as a workload, not as a shared application user, and it should receive only the minimum capability needed for the specific task. Policy engines can then make request-time decisions using contextual inputs such as user intent, resource classification, target system, and current risk posture. This is closer to zero trust than to classic prompt filtering, and it aligns with the direction set by the State of Non-Human Identity Security research, which shows how often organisations still lack visibility and confidence in NHI governance.

Use deterministic authorization outside the model, so a refusal or compliance pattern inside the model is never the final control.

Issue just-in-time credentials for each task and revoke them when the task completes.

Log tool calls, data access, and privilege changes separately from model prompts and outputs.

Apply policy at the action layer, where identity and context are verifiable.

For implementation guidance, teams commonly map this to NIST Cybersecurity Framework 2.0 functions and then adapt the workflow for AI-specific abuse paths described in the DeepSeek breach analysis. These controls tend to break down when the agent can chain multiple tools across systems because policy is enforced in one app but not across the full execution path.

Common Variations and Edge Cases

Tighter runtime control often increases latency, integration effort, and operational overhead, requiring organisations to balance safety against workflow speed. That tradeoff is real, especially when teams are trying to ship agentic features quickly. Best practice is evolving, but there is no universal standard yet for how much autonomy should be delegated to the model versus the surrounding policy layer.

One common edge case is the “mostly harmless” assistant that becomes dangerous through composition. A prompt that looks safe in isolation can still trigger a chain of tool actions that moves data, alters records, or retrieves secrets. Another is human-in-the-loop review, which helps but does not replace runtime authorization if the reviewer is approving outputs too late in the flow. Model guardrails can reduce obvious abuse, but they are not dependable as the only barrier when the system can call APIs, browse internal data, or execute code.

Security teams should also be careful not to confuse content moderation with access control. A model can be trained to avoid certain language and still be fully capable of performing a prohibited action through a different route. For that reason, current guidance suggests pairing model safety with workload identity, least privilege, and real-time policy evaluation. The practical lesson is straightforward: if the model can act, the control must be attached to the action, not the wording.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A01	Model guardrails fail when agent behavior is steered into unsafe tool use.
CSA MAESTRO	MAESTRO-3	MAESTRO addresses autonomy risks that model-only safety cannot stop.
NIST AI RMF		AI RMF emphasizes managing AI risk beyond model output quality alone.

Enforce action-layer controls so agent tool calls are authorized at runtime.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

What breaks when model-level guardrails are treated as security controls for AI systems?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group