How should security teams block resource-draining prompts in LLM applications?

Security teams should combine hard output caps, prompt-cost scoring, and request throttling. The goal is to identify unusually expensive generations before they consume shared capacity. Controls work best when they are enforced at the interface, tied to authenticated identity, and tuned separately for chat, code, and retrieval use cases.

Why This Matters for Security Teams

Resource-draining prompts are not just noisy traffic. In LLM applications, a single request can trigger long token runs, repeated retrieval calls, and expensive tool use that degrades service for everyone else. The practical risk is denial of service by cost and capacity exhaustion, especially when prompts are anonymous, high-volume, or chained through agent workflows. Current guidance suggests treating prompt abuse as both an application-layer abuse case and an identity problem.

That matters because the blast radius often extends beyond the model endpoint. Shared queues, retrieval indexes, and downstream APIs can all be pulled into one expensive request path. NHI Management Group research on the AI Agents: The New Attack Surface report shows how quickly autonomous workloads can create visibility gaps, while OWASP’s OWASP Agentic AI Top 10 and the NIST AI Risk Management Framework both reinforce the need for runtime controls, not just policy statements. In practice, many security teams encounter runaway costs only after the monthly bill spikes or the service starts timing out for legitimate users.

How It Works in Practice

Blocking resource-draining prompts works best as a layered control, not a single filter. First, set hard output caps so a request cannot exceed a maximum token budget. Second, score prompts before execution using features such as unusually long context windows, repeated instruction patterns, broad retrieval scope, or requests that are likely to trigger tool cascades. Third, throttle by authenticated identity, tenant, and request class so one user or workload cannot monopolise capacity.

For most teams, the useful pattern is to make admission control happen before generation starts. That means tying each request to an identity, assigning a cost estimate, and deciding whether to allow, delay, downgrade, or deny it. This is where security and reliability meet: a prompt that is acceptable in chat may be too expensive in code synthesis or retrieval augmented generation. NHI Management Group’s OWASP NHI Top 10 is relevant here because expensive prompts often arrive through the same identity and secret paths that support broader agent abuse.

Operationally, teams should combine:

per-request token ceilings and response truncation
rate limits by user, service account, and tenant
prompt-cost scoring tuned separately for chat, code, and retrieval
queue isolation for premium, internal, and public workloads
logging that preserves the prompt shape, identity, and decision outcome

Where possible, pair these controls with policy-as-code so thresholds can be adjusted without redeploying the application. The NIST AI 600-1 Generative AI Profile and CSA MAESTRO agentic AI threat modeling framework both support runtime governance over static trust assumptions. These controls tend to break down when prompts can trigger unbounded retrieval fan-out or external tool recursion because cost becomes non-linear and difficult to predict from the prompt text alone.

Common Variations and Edge Cases

Tighter prompt controls often increase user friction and false positives, requiring organisations to balance abuse prevention against legitimate high-cost workloads. That tradeoff is most visible in research, code generation, and multi-document retrieval, where expensive prompts can be valid and time-sensitive. There is no universal standard for this yet, so best practice is evolving rather than fixed.

One common edge case is internal automation. A trusted service account can still generate abusive load if it is compromised or misconfigured, so identity alone is not enough. Another is agentic workflows, where one prompt can spawn follow-on calls that multiply cost across tools and services. For those environments, request scoring should include downstream actions, not just token counts. NHI Management Group’s Analysis of Claude Code Security is a useful reference point for how code-oriented assistants expand the control surface, and the Anthropic AI-orchestrated cyber espionage campaign report illustrates how automation can compound risk when actions are chained.

Teams should also be careful not to block based only on prompt content. Attackers can distribute load across many small requests, while legitimate users may submit one unusually large request. That is why current guidance suggests combining content signals, identity, rate history, and environment-specific thresholds instead of relying on a single heuristic.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A1	Covers abuse through costly or runaway agentic requests and tool chains.
CSA MAESTRO	TRT-01	Addresses threat modeling for agent workflows and cost-amplifying control paths.
NIST AI RMF		Supports runtime risk controls and monitoring for generative AI abuse cases.

Add runtime cost limits and request gating before prompts can trigger expensive agent actions.

How should security teams block resource-draining prompts in LLM applications?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group