Subscribe to the Non-Human & AI Identity Journal

How should security teams test LLMs that can access tools and external data?

Security teams should test LLMs by simulating real runtime abuse, not only prompt injection. That means validating how the model handles hidden instructions, retrieval poisoning, tool calls, and unsafe output escalation. The right question is whether the system can be constrained when context changes, because that is where the operational risk emerges.

Why This Matters for Security Teams

Testing an LLM that can call tools or query external data is not the same as testing a chat interface. Once the model can fetch records, trigger workflows, or chain tool calls, the security problem shifts from prompt quality to runtime control. That is why guidance from the OWASP Agentic AI Top 10 and the NIST AI Risk Management Framework emphasizes context, authorization, and abuse resistance rather than simple output filtering.

Practitioners should assume attackers will not stop at obvious prompt injection. They will poison retrieval sources, hide instructions inside documents, coerce tool usage, and search for paths that turn a harmless answer into a privileged action. NHIMG research on AI LLM hijack breach shows how quickly compromised identities and weak runtime boundaries can turn an AI system into an access path, not just a text generator. In practice, many security teams encounter this only after the model has already retrieved, transformed, or exposed data that was never meant to leave its original control plane.

How It Works in Practice

Effective testing starts by building scenarios that mirror the full runtime path: user prompt, hidden system instructions, retrieval layer, policy checks, tool invocation, and post-tool response handling. The goal is to observe where the model can be manipulated, where the application trusts model output too much, and where privilege expands across steps. Current best practice is to test the whole chain, not just the first prompt.

A practical test plan should include:

  • Prompt injection against both visible and hidden instructions, including attempts to override policy or induce unsafe disclosure.
  • Retrieval poisoning using malicious or misleading documents, especially where the model ranks external data above internal policy.
  • Tool-call abuse such as unauthorized API calls, parameter tampering, over-broad search, and chained actions across multiple tools.
  • Data exfiltration tests that verify whether sensitive content can be echoed, summarized, transformed, or re-sent through a downstream tool.
  • Authority checks that confirm the model cannot exceed the scope of the authenticated user or workload identity.

For runtime governance, security teams should pair application testing with identity and authorization controls. The OWASP Non-Human Identity Top 10 is useful here because tool-using LLMs often rely on secrets, tokens, and service identities that need tighter lifecycle control than human accounts. NHIMG’s Ultimate Guide to NHIs is also relevant when teams need to align secret management with workload behavior rather than application ownership alone.

These controls tend to break down when the LLM is connected to multiple external systems with inconsistent authorization models, because the model can exploit the weakest tool boundary even if the front-end prompt layer is well defended.

Common Variations and Edge Cases

Tighter tool governance often increases test complexity and slows iteration, so organisations must balance stronger containment against developer productivity and evaluation coverage. That tradeoff becomes more visible when the LLM depends on retrieval-augmented generation, long-running workflows, or delegated actions across several services.

There is no universal standard for exactly how much tool autonomy is acceptable yet, so current guidance suggests testing by risk tier. High-impact systems should be evaluated with realistic adversarial data, rate-limited credentials, and short-lived access tokens. Lower-risk systems may tolerate narrower test scope, but they still need checks for hidden instruction handling and output escalation.

Edge cases often appear when the model is “read only” in theory but can still trigger side effects through secondary tools, shared integrations, or callback handlers. They also appear when external data sources are trusted by default, even though the retrieval index may include unvetted content. For deeper context on this broader attack surface, NHIMG’s OWASP NHI Top 10 and the CSA MAESTRO agentic AI threat modeling framework help frame where autonomy, identity, and tool access intersect. The hardest failures usually show up when the system behaves safely in isolated prompts but fails once retrieval, permissions, and external actions are combined in production-like sequences.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Agentic AI Top 10 A3 Tests must cover prompt injection, tool abuse, and runtime escalation.
CSA MAESTRO TRM Focuses threat modeling on agent autonomy, tools, and data flows.
NIST AI RMF AI RMF applies risk-based testing and ongoing monitoring for AI systems.

Use AI RMF to define testing scope, measure harms, and reassess controls after each change.