Start by attributing every model call to a specific identity, session, or application, then enforce quotas and rate limits at the gateway. Add context caps so long conversations do not grow indefinitely, and route simple tasks to cheaper models. The control objective is to make spend visible and bounded before finance discovers the problem in the invoice.
Why This Matters for Security Teams
Runaway AI token spend is usually a symptom of weak identity and workload controls, not just an expensive model choice. When every prompt, tool call, and retrieval loop can incur cost, teams need visibility into which identity is generating usage and why. Without that attribution, finance only sees the overage after the fact, while security and platform teams are left guessing which service, agent, or session caused the surge.
This is why cost governance belongs alongside access governance. The same discipline used to control secrets exposure in incidents like the Guide to the Secret Sprawl Challenge applies here: if identities are reused, sessions are long-lived, and usage is not tied to a named workload, spend becomes both unpredictable and hard to investigate. Current guidance in the NIST Cybersecurity Framework 2.0 supports this kind of measurable control, even though it does not prescribe token budgeting specifically.
NHIMG research also shows how identity misuse becomes operationally visible only after damage starts, as seen in the Salesloft OAuth token breach, where compromised tokens were used to reach downstream systems. In practice, many security teams encounter AI overspend only after a shared API key, runaway agent loop, or misconfigured integration has already consumed budget at scale.
How It Works in Practice
The control objective is to make AI usage attributable, bounded, and interruptible. That starts with mapping every request to a specific user, workload, service account, or agent session, then enforcing quotas at the gateway or broker layer before calls reach the model. For agentic systems, static IAM alone is not enough because the workload can chain tool calls, retry aggressively, or branch into long-running reasoning loops that were never obvious at design time.
Practical controls usually combine several layers:
- Per-identity quotas for daily, hourly, or per-session token usage
- Rate limits for burst suppression and retry storms
- Context caps so prompts, memory, and retrieval results do not grow without bound
- Model routing so low-risk or repetitive tasks use cheaper models
- Alerting for abnormal spend patterns tied to a specific application or agent
Security teams should also treat cost policy as policy-as-code. In mature environments, a request is evaluated in real time, with context such as identity, environment, task type, and risk tier. This aligns with the broader direction described in LLMjacking: How Attackers Hijack AI Using Compromised NHIs, where exposed credentials become a fast path to abuse. The operational lesson is that spend controls and abuse controls are the same control plane when access is machine-driven.
Teams that already use centralized API gateways, service meshes, or model proxies can usually enforce these limits without redesigning the application. The harder part is eliminating shared credentials and orphaned service identities, because a single overused key can hide multiple consumers behind one billing line. These controls tend to break down in multi-tenant agent platforms where dozens of autonomous workloads share a common inference proxy and attribution is not preserved end to end.
Common Variations and Edge Cases
Tighter spend controls often increase operational friction, so organisations need to balance budget protection against developer velocity and user experience. That tradeoff becomes sharper in research teams, customer-facing copilots, and autonomous agent pipelines where legitimate usage can spike unpredictably. Best practice is evolving, and there is no universal standard for token budgeting yet.
One common edge case is shared infrastructure. If many apps call the same model endpoint through one gateway, quotas should still be enforced on the originating identity, not just the network source. Another is long-running agent loops, where a model may spend moderately on each step but run indefinitely unless there is a hard stop on session duration, tool recursion, or maximum reasoning depth.
For high-value workloads, some organisations add approval thresholds or spend escalation paths, but those controls are usually reserved for exceptional usage rather than normal operations. It is also wise to review whether a task truly needs a frontier model, since routing routine classification, summarisation, or extraction to a smaller model can materially reduce waste without weakening control. That approach is consistent with broader governance thinking in the Secret Sprawl Challenge, where visibility and lifecycle discipline matter more than one-off cleanup.
Overspend is most difficult to contain in environments with shared API keys, weak session attribution, and autonomous retries triggered by upstream failures, because the same problem that drives cost up also obscures who caused it.
Standards & Framework Alignment
This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.
OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.
| Framework | Control / Reference | Relevance |
|---|---|---|
| OWASP Agentic AI Top 10 | A2 | Agentic systems can loop, retry, and chain tool calls, driving uncontrolled token usage. |
| CSA MAESTRO | AI-04 | MAESTRO addresses governance and runtime controls for autonomous AI workloads. |
| NIST AI RMF | AI RMF supports measurable governance for AI usage, risk, and accountability. |
Enforce per-agent quotas, attribution, and approval gates in the model gateway or orchestration layer.
Related resources from NHI Mgmt Group
- How should organisations use AI agents in access reviews without losing governance control?
- How should organisations use AI in access request approval without weakening control?
- How do organisations know whether AI is truly under governance control?
- What should organisations control before exposing identity telemetry to AI assistants?