Anthropic’s shared responsibility model reframes AI agent governance

By NHI Mgmt Group Editorial TeamPublished 2026-04-29Domain: Agentic AI & NHIsSource: Backslash Security

TL;DR: Anthropic’s paper breaks AI agent security into four layers, model, harness, tools, and environment, and reports that 93% of permission prompts are approved without reading while complex tasks trigger clarification only 16.4% of the time. The governance gap is no longer theoretical: organisations must treat agent permissions, tool drift, and deployment context as persistent NHI controls, not user prompts.

At a glance

What this is: Anthropic’s framework maps AI agent security to four responsibility layers and shows why human approval alone is failing at production scale.

Why it matters: IAM and NHI teams need a control model for autonomous agents because access, tools, and environment now determine blast radius more than model choice.

By the numbers:

Anthropic’s data shows 93% of permission prompts are approved without reading.
On complex tasks, Claude asks for clarification on 16.4% of turns.

👉 Read Backslash Security's analysis of Anthropic's shared responsibility model for AI agents

Context

AI agent governance is now an access-control problem, not just a model-safety problem. Once an agent can hold state, call tools, and operate inside production systems, the security question becomes who controls its permissions, its instructions, and its operating context.

Backslash Security’s article uses Anthropic’s shared-responsibility framing to argue that existing standards do not fully cover agent-caused harm within authorised access. That makes the issue directly relevant to NHI governance, because agents behave like non-human identities with execution authority and tool reach.

The starting point in the article is typical of the market: teams are still treating AI agent oversight as a human approval exercise, even though that model breaks down quickly under volume. The more useful frame is lifecycle governance across the whole agent estate, from onboarding to tool change control to offboarding.

Key questions

Q: How should security teams govern AI agents as non-human identities?

A: Security teams should govern AI agents as persistent non-human identities with scoped authority, not as transient prompts. That means assigning ownership for model, harness, tools, and environment, then enforcing least privilege, change control, and runtime monitoring across each layer. Approval alone is not enough when agents can act repeatedly at machine speed.

Q: When does human approval become ineffective for AI agent security?

A: Human approval becomes ineffective when volume, speed, or ambiguity causes reviewers to stop reading before approving. At that point the control is ceremonial, not operational. Organisations should treat approval as exception handling and move routine protection into policy, task scoping, and automated enforcement that does not depend on attention span.

Q: What is the difference between controlling an AI model and controlling an AI agent?

A: Controlling a model focuses on what the system says or refuses. Controlling an agent also covers what it can do, what tools it can invoke, what memory it retains, and what systems it can reach. For security teams, the agent problem is an identity and access problem, not just a content-safety problem.

Q: Why do AI agents complicate zero trust architecture?

A: AI agents complicate zero trust because they can hold state, reuse credentials, and make tool calls across multiple systems without a fresh human decision each time. Zero trust still applies, but it must be enforced continuously at the harness, tool, and environment layers. Otherwise the agent becomes a high-speed trust multiplier.

Technical breakdown

How the four-layer AI agent security model works

Anthropic’s framework separates AI agent security into Model, Harness, Tools, and Environment. The model layer covers how the system reasons and refuses harmful requests. The harness layer is the policy and instruction wrapper, including system prompts and approval logic. Tools include MCP servers, APIs, and plugins. Environment covers where the agent runs and what it can reach. The important architectural point is that only one layer sits with the model provider. The other layers are owned by the deploying organisation, which is why security failures often appear as configuration drift rather than model failure.

Practical implication: Inventory each agent against the four layers before you try to govern it.

Why MCP server changes create hidden NHI exposure

MCP servers extend an agent’s effective authority by exposing tools and data sources. That makes them part of the NHI attack surface because an approved connector can change after initial review, adding capabilities or altering behaviour without a new access decision. In practice, this is a supply-chain problem for agent tooling: the trust decision happens once, but the risk can change many times. The main failure mode is not just compromise. It is stale authorisation applied to a moving target, which is a familiar IAM problem with a new execution layer.

Practical implication: Treat tool approvals as time-bound and revalidate them whenever capabilities or prompts change.

Why human-in-the-loop approval breaks down for agents

Human-in-the-loop control assumes people can meaningfully review each action before execution. Anthropic’s data suggests that does not scale. When 93% of prompts are approved without reading, the human is no longer a control. The agent is effectively operating under delegated trust, with the person serving as a backstop after the fact. That shifts the security question from review to policy design. The relevant control is not whether a human saw the action, but whether the agent had the right to attempt it in the first place.

Practical implication: Move from per-action approval to policy-based guardrails and task-scoped delegation.

Threat narrative

Attacker objective: The attacker aims to turn legitimate agent privileges into unauthorised access, data exposure, or unsafe actions inside the enterprise stack.

entry: An attacker can exploit an overly permissive or changed MCP server to reach an agent’s trusted tool path.
escalation: The agent accepts poisoned memory, ambiguous instructions, or a compromised peer output as valid context.
impact: The agent performs authorised but harmful actions inside systems the organisation expected to remain protected.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

AI agents should be governed as NHI with layered responsibility, not as clever user interfaces. The strongest part of Anthropic’s framing is that it names ownership boundaries instead of collapsing all risk into the model. That is the right direction for IAM because agents hold credentials, invoke tools, and operate continuously. Security programmes that still treat them as session-based prompts will miss the real exposure. Practitioners should govern agents as persistent identities with scoped authority.

Ephemeral approval is not the same as least privilege. A human clicking approve on a request does not create durable assurance if the agent can later repeat similar actions at scale. The trust problem is structural: access decisions must be tied to task scope, data sensitivity, and tool reach. That means organisations need policy at the harness and environment layers, not just better UX around approvals. Practitioners should replace ad hoc consent with enforced control boundaries.

Runtime change control is becoming the new identity control plane for agents. Once tools can update, expand, or chain into other agents, the security posture changes after the initial onboarding decision. That means access reviews alone are not enough. Organisations need continuous validation of tool state, memory state, and reachable systems. Practitioners should assume that an approved agent can become risky without any new login event.

Identity blast radius is the concept this market needs. The relevant question is no longer only who or what authenticated, but how far that entity can move once inside. AI agents expand blast radius because they combine credentials, autonomy, and machine speed. That makes NHI governance more about containment than authentication. Practitioners should measure and reduce blast radius before they add more autonomy.

Shared responsibility for agents will push the market toward control-plane tooling. The industry is moving beyond static policy documents toward systems that can inspect tool changes, enforce task limits, and monitor execution continuously. That shift will favour governance models that connect identity, permissions, and runtime telemetry. Practitioners should expect agent security to converge with NHI control-plane design.

From our research:
96% of technology professionals identify AI agents as a growing security threat, and 66% believe this risk is immediate, according to AI Agents: The New Attack Surface report.
Only 44% have implemented policies to govern AI agents, even though 92% agree that governance is critical for enterprise security.
For a deeper control model, see Ultimate Guide to NHIs , Standards and map agent controls to the underlying NHI lifecycle.

What this signals

Identity blast radius will become the practical measure that separates mature agent programmes from experimental ones. If an agent can reach customer data, deploy code, and call third-party tools, then a single policy miss can become an enterprise-scale event. The governance task is to reduce reachable harm before autonomy expands further.

With 80% of organisations reporting that their AI agents have already acted beyond intended scope, the operational question shifts from adoption to containment. That is why the control stack must move toward continuous authorisation, not periodic review, and why the OWASP NHI Top 10 remains a useful reference for agentic exposure patterns.

Programme owners should expect agent security to converge with broader identity governance, especially where tooling, memory, and runtime context all change after onboarding. In that environment, continuous monitoring of reachable systems matters more than one-time approval, and lifecycle controls become the difference between manageable autonomy and unmanaged exposure.

For practitioners

Map every agent to its four responsibility layers Document which model, harness rules, connected tools, and runtime environment apply to each agent. Include the owner for each layer and the last review date for tool changes. This gives you a control baseline before you start tuning policy.
Replace per-action approval with task-scoped policy Define what an agent may do by job function, data class, and tool set, then enforce those limits continuously. Human approval should become exception handling, not the normal control path.
Revalidate MCP servers after any capability drift Review tool descriptions, permissions, and outputs whenever an MCP server changes version, adds a function, or expands its data reach. Treat tool drift as a control failure, not routine maintenance.
Track agent blast radius as a formal risk metric Measure which systems, datasets, and secrets each agent can reach, then rank agents by the maximum harm they can cause if misdirected. Use that ranking to prioritise containment, segmentation, and monitoring.

Key takeaways

AI agents create an access-governance problem because autonomy, tools, and state expand what an identity can do after authentication.
Anthropic’s own data shows that human review is already failing at scale, which makes approval-only controls too weak for production use.
Security teams should focus on blast radius, tool drift, and task-scoped policy if they want agent governance to hold up operationally.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Agent tool misuse and inter-agent trust failures map directly to this framework.
NIST AI RMF		AI governance and accountability apply to autonomous agent decision-making.
NIST Zero Trust (SP 800-207)	PR.AC-4	Continuous verification and least privilege are essential for autonomous agent access.

Assess agent tool access, memory, and communication paths against the agentic AI risk model.

Key terms

AI Agent: An AI agent is autonomous software that can decide, act, and use tools on behalf of a task. In security terms, it behaves like a non-human identity because it can hold credentials, access data, and create real-world effects without constant human intervention.
Harness: The harness is the layer of instructions, policies, and approval logic wrapped around an AI agent. It is where organisations try to constrain behaviour, but it only works if the rules are explicit, current, and enforced outside the model itself.
MCP Server: An MCP server is a tool endpoint that connects an AI agent to external systems and data sources through Model Context Protocol. Because it extends what the agent can reach, it becomes part of the identity and access surface and must be reviewed like any other privileged connector.
Identity Blast Radius: Identity blast radius is the maximum damage a non-human identity can cause if it is misused or compromised. For AI agents, it is shaped by tool access, data reach, and runtime autonomy, which is why containment matters as much as authentication.

Deepen your knowledge

AI agent governance, tool control, and lifecycle access are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If your team is building controls for autonomous systems, this is a practical place to start.

This post draws on content published by Backslash Security: Anthropic's shared responsibility security model for AI agents, explained. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-04-29.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org