Who is accountable when an LLM leaks data after following malicious instructions?

Why This Matters for Security Teams

When an LLM leaks data after following malicious instructions, the failure is usually not “the model got hacked” in isolation. The real issue is that an organisation gave the model access to sensitive context, tool execution, or downstream systems without constraining what the model could do with untrusted input. That makes accountability a governance and control-design problem, not a blame exercise aimed at the model itself.

This is why current guidance increasingly treats agentic and LLM risk as an identity and authorisation problem. The OWASP Agentic AI Top 10 and the NIST AI Risk Management Framework both push organisations toward clearer ownership, better input isolation, and runtime controls instead of assuming the model will self-separate trusted from untrusted instructions.

NHIMG research on OWASP NHI Top 10 shows why this matters operationally: once an identity can call tools, query data, or move across workflows, a single prompt injection can become a broader access event. In practice, many security teams encounter accountability gaps only after sensitive data has already been exposed, rather than through intentional access governance.

How It Works in Practice

Accountability should follow control over the system, not the textual output alone. In most real deployments, the organisation that approved the use case, connected the data sources, configured the tools, and set the access policy owns the risk when the LLM acts on malicious instructions. That includes application owners, IAM teams, security engineering, and the business function that accepted the workflow.

The practical control model is to separate instruction channels, restrict tool scope, and make every sensitive action a runtime decision. That means using workload identity for the agent or model-backed service, short-lived credentials, and policy checks at the moment of access. The emerging direction is context-aware authorisation, where a request is allowed only if the task, data sensitivity, caller identity, and destination tool all match policy.

Use a distinct identity for the LLM service or agent, not shared human credentials.

Issue just-in-time secrets or tokens for narrowly defined tasks, then revoke them quickly.

Evaluate prompts, retrieved content, and tool calls separately so untrusted text cannot inherit trust.

Log tool use, data access, and policy decisions so attribution is possible after an incident.

NHIMG’s AI LLM hijack breach coverage and the 52 NHI Breaches Analysis both reinforce a common pattern: once an LLM can reach secrets, databases, or SaaS APIs, malicious instructions often turn into credential exposure or unauthorised data movement. This aligns with the Anthropic AI-orchestrated cyber espionage report, which shows that autonomous systems can chain actions in ways humans do not predict. These controls tend to break down when an LLM is embedded inside a legacy app that still treats every internal call as trusted because the application cannot distinguish user intent from model intent at runtime.

Common Variations and Edge Cases

Tighter control often increases integration overhead, requiring organisations to balance faster agent workflows against stronger containment and auditability. There is no universal standard for this yet, so the right answer depends on whether the system is a chat assistant, a retrieval pipeline, or an autonomous agent with tool execution.

One common edge case is shared platform ownership. If one team runs the model API while another team owns the data source, accountability is split unless roles are explicitly documented. Another is vendor-hosted LLMs, where the enterprise still owns the access decisions it made, even if the model is external. Best practice is evolving toward clear control mapping rather than assuming the vendor absorbs liability for misrouted data.

For agentic systems, the distinction between “the model leaked data” and “the workflow allowed data leakage” matters. If the prompt included malicious instructions, the question becomes whether input filtering, tool isolation, least privilege, and approval gates were designed to stop that path. Where retrieval-augmented generation, multi-agent orchestration, or autonomous code execution is involved, the accountability surface expands to include orchestration logic and downstream identity bindings. This is why the CSA MAESTRO agentic AI threat modeling framework and NIST guidance are increasingly used to assign ownership across the full execution chain, not just the model endpoint.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A2	Prompt injection and tool abuse drive this leakage scenario.
CSA MAESTRO	T1	Maps accountability across agent orchestration and connected systems.
NIST AI RMF	GOVERN	Defines accountability, oversight, and risk ownership for AI systems.

Document accountable owners, approval gates, and incident responsibilities for AI use.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Who is accountable when an LLM leaks data after following malicious instructions?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group