Agentic AI & Autonomous Identity

Why can a clean AI model still produce unsafe tool calls?

By NHI Mgmt Group Editorial Team Updated July 5, 2026 Domain: Agentic AI & Autonomous Identity

A model can be clean while its decoded output is altered. If a tokenizer changes the string attached to a token ID, the model still predicts the same ID but the executor receives different text. That creates a gap between model behaviour and action behaviour, which is why output validation must happen after decoding and before execution.

Why This Matters for Security Teams

A clean model does not guarantee safe action. The risk appears when a tool-using agent turns a model token into executable text, because the model’s intent can remain unchanged while the decoded payload becomes operationally different. That is especially dangerous in agentic workflows where a single malformed call can touch credentials, files, APIs, or admin functions. Current guidance from the NIST Cybersecurity Framework 2.0 still applies: outputs need controls, not just inputs. The same gap shows up in real-world NHI incidents, including the LLMjacking research, where compromised identities and exposed credentials enabled attacker action without changing the model itself. In practice, many security teams encounter unsafe tool use only after a benign-looking prompt has already been transformed into a damaging execution path, rather than through intentional validation design.

Security teams often focus on prompt filters, model alignment, or jailbreak detection, but those controls do not solve the execution boundary. The critical question is whether the text that reaches the tool runner matches the policy that was evaluated. If a tokenizer, decoder, or post-processing layer changes the string associated with a token ID, the model may still be “correct” while the tool call becomes unsafe. This is why output validation must sit after decoding and before execution.

That boundary is becoming more important as agentic systems chain tools automatically. A harmless-sounding model output can become a shell command, API request, or secret retrieval action with real side effects. The State of Secrets in AppSec research shows how common secrets exposure and slow remediation already are, which makes any unsafe tool call more consequential. NIST guidance also reinforces that cybersecurity controls should be embedded into system design, not applied only at the perimeter.

Validate the final decoded text, not the token prediction alone.
Compare tool arguments against allowlists, schemas, and policy rules before execution.
Separate model generation from execution privileges so the model cannot directly act.
Log the exact decoded payload that was approved and the exact payload that was executed.

These controls tend to break down in event-driven agent pipelines where multiple services can rewrite, enrich, or auto-repair the tool payload before execution.

How It Works in Practice

The safest pattern is to treat the model as a suggestion engine and the executor as the enforcement point. The model generates candidate output, but a separate control layer checks the decoded string, validates structure, and decides whether a tool call is allowed. That means the security decision happens on the final text, not on the latent representation or token ID sequence.

For tool-using agents, the practical workflow is usually:

Decode the model output into the exact string the tool runner will consume.
Normalize that string so hidden encoding tricks, unsafe escapes, and ambiguous formatting are visible.
Validate the request against a schema, policy rule, or tool-specific allowlist.
Block or rewrite anything that requests privileged actions, broad data access, or unexpected parameter values.
Issue short-lived credentials only if the request is approved for that exact task.

This is where runtime policy matters more than static role design. A model can appear well-behaved in testing yet still emit an unsafe call when the decoded text changes after tokenization, templating, or downstream formatting. That is why current best practice is evolving toward execution-time controls, policy-as-code, and tightly scoped workload identity rather than assuming model outputs are inherently trustworthy. The NIST framework and the DeepSeek breach both underscore the same lesson: trusted behaviour must be enforced at the boundary where action happens.

These controls tend to break down when an agent is allowed to self-compose tool chains across multiple services because each handoff can introduce a new transformation step.

Common Variations and Edge Cases

Tighter output validation often increases latency and engineering overhead, requiring organisations to balance safety against operational complexity. There is no universal standard for every tool format yet, so teams need environment-specific controls rather than one generic filter.

One common edge case is auto-correction. Some systems “fix” malformed output before execution, which can accidentally turn a borderline request into a more dangerous one. Another is hidden tool syntax inside JSON, YAML, shell fragments, or structured prompts where the model seems to emit one meaning but the executor interprets another. Best practice is evolving, but the safest approach is to reject ambiguity rather than guess the intended action.

Another failure mode appears in systems that share a single agent across read and write operations. If the same identity can inspect data, invoke tools, and modify state, then a decoded-output mismatch becomes a privilege escalation path. Security teams should keep tool scopes narrow, prefer short-lived credentials, and require re-validation for every state-changing action. In other words, the model can be clean while the action layer is unsafe, and that gap is what defenders must close.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	LLM-06	Unsafe tool calls are a core agentic execution risk.
CSA MAESTRO	A1	MAESTRO covers runtime control of autonomous agent actions.
NIST AI RMF		AIRMF addresses managing AI system risks across the lifecycle.

Validate decoded tool output before execution and reject ambiguous or over-privileged actions.

Deepen Your Knowledge

Ultimate Guide to NHIs → NHI Foundation Course → Discussion Forum →

NHIMG Editorial Note
Reviewed and updated by the NHIMG editorial team on July 5, 2026.
NHI Mgmt Group — the #1 independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Get in Touch

Quick Links

FAQ

NHI 101 Articles

Legal & Policies

Why can a clean AI model still produce unsafe tool calls?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group