AI model recall misses the real control point in agent security

By NHI Mgmt Group Editorial TeamPublished 2026-06-14Domain: Agentic AI & NHIsSource: Pillar Security

TL;DR: A model recall does not address the real security boundary in AI systems, because refusals are statistical and the decisive control sits in the harness and blast radius around the model, according to Pillar Security. The durable fix is to bound what an AI system can reach and do, not to rely on the model to police itself.

At a glance

What this is: This is an opinion analysis arguing that AI model refusals are not a security boundary and that the real control point is the external harness and blast radius.

Why it matters: It matters because IAM, NHI, and AI governance teams must control what an AI system can reach, not just whether it behaves well under prompt pressure.

👉 Read Pillar Security's analysis of AI model refusals, prompt injection, and blast radius

Context

A model refusal is not the same thing as an access control. In practice, a model produces outputs probabilistically, while real security boundaries live in the systems around it that decide what it can call, read, or exfiltrate.

The article argues that AI security programmes are focusing on the wrong variable when they treat model behaviour as the primary control surface. For identity practitioners, the relevant question is the blast radius of the deployed system, including tools, files, and credentials reachable by the model.

That distinction matters across agentic AI, NHI-style service access, and human governance because once a system can act on data and tools, the attack surface is no longer the model alone. The control problem shifts to the permissions and guardrails that sit outside it.

Key questions

Q: How should security teams limit AI system damage when model refusals are unreliable?

A: Security teams should limit the AI system's reachable blast radius first, then enforce policy outside the model for any tool use, data access, or secret handling. If the model can decide and act without an external gate, refusal quality becomes secondary. The safest design treats the model as untrusted decision support, not as the control point.

Q: Why do prompt injections remain dangerous even when a model is well aligned?

A: Prompt injections target the harness, not just the model. They exploit the fact that the system may trust retrieved content, tool output, or embedded instructions enough to act on them. Alignment does not solve that problem because the unsafe decision happens in the surrounding workflow, where privileged actions are still reachable.

Q: What do identity teams get wrong about AI access controls?

A: They often focus on whether the model behaves safely instead of whether the system's permissions are tightly scoped. Identity teams should think in terms of who or what can reach secrets, tools, and files, then reduce those privileges before deployment. That is the same blast-radius logic used in NHI governance.

Q: How do you know if an AI agent is overexposed?

A: An AI agent is overexposed when it can reach resources that are not strictly necessary for its job, especially secrets, execution tools, and internal data stores. A good test is simple: remove a permission and see whether the use case still works. If not, the permission may be justified; if yes, it is likely excess reach.

Technical breakdown

Why model refusals are not a security boundary

A model refusal is a statistical tendency, not an enforced control. That means the model can be induced to produce outputs it was trained to avoid, because the security property is probabilistic rather than deterministic. In security terms, that is very different from an access control, which either permits or blocks an action. The article uses this distinction to show why patching the model does not close the attack surface. For identity teams, the lesson is that policy must be enforced outside the model if the action matters.

Practical implication: do not treat model alignment as a substitute for externally enforced access control.

Prompt injection targets the harness, not the model

Prompt injection works by influencing the instructions and data that the model reads, then steering it toward an attacker-chosen outcome. The model is not being “broken” in the same way a jailbreak breaks refusals; it is being misled through the surrounding system. That makes the harness, tools, and data sources part of the security boundary. Once an AI system can reach internal resources, the issue becomes whether those resources are exposed to untrusted instructions, not whether the model can be persuaded to say no.

Practical implication: separate untrusted input, tool outputs, and privileged actions with hard policy gates outside the model.

Blast radius is the real control variable in AI operations

Blast radius is the set of tools, files, APIs, and credentials an AI system can actually reach once it is running. The article argues this is the variable that still matters, because capability is widely available while reach is something teams can constrain. In identity terms, this is a permissioning problem, not a model-quality problem. Once an AI system can act on secrets or internal systems, even a well-behaved model can create outsized damage if the surrounding controls are weak.

Practical implication: map and reduce the model's reachable resources before expanding its operational use.

NHI Mgmt Group analysis

Model refusals were never the control boundary, and treating them as one is a category error. Refusal behaviour is probabilistic, so it cannot function like an identity policy or an authorization gate. That assumption breaks as soon as the system is allowed to act on tools, files, or credentials. The implication is that AI governance must start with the reachable environment, not with the model's output layer.

Blast radius is the named control concept that should replace model-centric security thinking. The meaningful security question is not whether the model can be coerced, but what it can reach if coerced. That reframes the problem in a way IAM and NHI teams already understand: privilege scope determines damage potential. Practitioners should evaluate every deployed agent by its reachable resources, not by its benchmark behaviour.

Prompt injection and jailbreaks are different failure modes, and conflating them weakens governance. A jailbreak attacks the model's refusals, while prompt injection attacks the harness through data the system accepts. The article is right to separate those layers because different controls fail in each case. Security teams should map controls to the layer actually exposed to untrusted input.

AI systems inherit the old identity lesson that access matters more than intelligence. A smarter model with broad reach is a larger governance problem than a weaker model with tight constraints. That principle applies across NHI, autonomous, and human identity programmes: privilege defines impact. The practitioner conclusion is to govern AI systems as access-bearing actors, not as isolated software features.

Runtime controls outside the model are the only defensible boundary for high-risk AI use. The article's central point is that the model cannot be trusted to police itself, so enforcement has to happen where actions are initiated and resources are touched. This aligns with NIST CSF access and control principles and with Zero Trust thinking. Teams should design policy around action paths, not model reassurance.

From our research:
98% of companies plan to deploy even more AI agents within the next 12 months, despite documented rogue behaviour in 80% of current deployments, according to AI Agents: The New Attack Surface report.
A separate finding in the same report shows that 33% of organisations say their AI agents have already accessed inappropriate or sensitive data beyond intended scope.
For the broader attack-surface view, see OWASP Agentic AI Top 10 for controls that sit outside the model and constrain tool misuse.

What this signals

Blast radius, not model behaviour, should become the unit of AI governance. The operating question for practitioners is which tools, secrets, and data paths an agent can reach if its output is steered. In a programme built on least privilege, that means every AI workflow needs an entitlement review just like any other access-bearing identity.

The stronger signal is that agent deployments are scaling faster than governance maturity. With 98% of companies planning to deploy even more AI agents within 12 months, the control gap will widen unless teams inventory trusted sources, tool permissions, and data egress paths now.

For identity teams, the practical shift is to align AI controls with Zero Trust and NHI patterns rather than model confidence scores. That means external policy enforcement, scoped access, and clear ownership for every agentic action path.

For practitioners

Define the model's reachable blast radius Inventory every API, file store, secret, and execution path the AI system can touch, then remove anything that is not required for the use case. Keep the list tied to the deployed harness, not the model vendor's published capability set.
Enforce action controls outside the model Place deterministic policy checks between the model and any privileged operation so the model cannot directly decide on sensitive access, data movement, or tool invocation. The control must evaluate the call itself, not the model's intent.
Separate untrusted content from trusted instructions Treat retrieved web pages, tool outputs, and user prompts as different trust classes, and block any path where untrusted text can be interpreted as governance instructions. This reduces prompt injection risk in retrieval and agent workflows.
Redesign red teaming around reachable actions Test what an attacker can do if the model accepts malicious instructions, then validate the containment of each resulting action path. Use the exposure model to prioritise controls around secrets, code execution, and external calls.

Key takeaways

AI model refusals are not a security boundary, because they are statistical rather than enforced.
The real control variable is blast radius, meaning the tools, files, and credentials an AI system can actually reach.
Practitioners should govern AI systems with external policy gates and tight entitlements, not with model reassurance alone.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	AGENT-03	The article centres on prompt injection and tool misuse in agentic systems.
NIST CSF 2.0	PR.AC-4	Blast-radius reduction depends on least-privilege access to tools and data.
NIST Zero Trust (SP 800-207)	AC-4	The article argues for controls outside the model at decision points.

Bind agent actions to external policy checks and restrict untrusted instructions from reaching privileged tools.

Key terms

Blast Radius: The set of tools, data, files, and credentials an AI system can reach if it is prompted, steered, or hijacked. In identity security, blast radius is the practical measure of exposure, because it defines how far a mistake or attack can travel before a control stops it.
Prompt Injection: An attack that hides instructions in content the AI system reads, such as retrieved documents, web pages, or tool output. The model may follow those instructions because the surrounding workflow failed to separate untrusted text from trusted control inputs.
Jailbreak: A technique that tries to make a model ignore its own refusal behaviour and produce output it was trained to avoid. It attacks the model layer directly, which is different from attacks that exploit the tools and data around the model.
Harness: The software and policy layer surrounding a model, including tools, connectors, routing, and enforcement points. This layer determines what the model can access and what actions are allowed, which is why it is often the real security boundary.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.

This post draws on content published by Pillar Security: The Fable Recall Puts the Spotlight in the Wrong Place. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-06-14.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org