Hard boundaries, not soft guardrails, define agentic AI security

By NHI Mgmt Group Editorial TeamPublished 2026-03-13Domain: Agentic AI & NHIsSource: Zenity

TL;DR: Agentic AI browsers and copilots remain vulnerable to prompt injection because probabilistic guardrails cannot reliably separate trusted intent from malicious instructions, according to Zenity’s PerplexedComet analysis. The real security boundary is deterministic enforcement at the code, network, or OS layer, where the model never gets a vote.

At a glance

What this is: This analysis argues that soft guardrails are insufficient for agentic AI and that hard, code-level boundaries are the only reliable prevention layer against prompt injection and data exfiltration.

Why it matters: IAM, PAM, and NHI teams need to treat agentic systems as bounded identities with enforceable capability limits, not as prompt-tuned software that can be supervised after the fact.

👉 Read Zenity's analysis of why soft guardrails fail against agentic AI attacks

Context

Agentic AI security fails when teams assume a model can reliably tell trusted instructions from untrusted input. In practice, prompt-level controls are probabilistic, while the dangerous actions happen at runtime through tool use, file access, and egress paths that existing governance rarely constrains tightly enough.

Zenity’s PerplexedComet analysis shows why this matters for identity programmes: an autonomous browser can merge user intent and attacker input into one execution path. That makes the agent’s effective privileges, reach, and containment boundary the real governance problem, not the prompt text surrounding the action.

Key questions

Q: How should security teams prevent prompt injection in agentic AI systems?

A: Security teams should prevent prompt injection by removing dangerous capabilities at the environment level, not by relying on the model to judge intent correctly. Prompt filters, confirmations, and behavioural guardrails can add friction, but deterministic controls over file access, egress, and tool reach are what stop exfiltration when the content is hostile.

Q: Why do soft guardrails fail in agentic AI security?

A: Soft guardrails fail because they are probabilistic and operate in the same reasoning space as the agent they supervise. An attacker can shape untrusted content so the model treats it as part of the task, which means the control can be bypassed or applied too late to matter. Prevention has to sit outside the model.

Q: What breaks when an agent can reach local files and network egress?

A: What breaks is the assumption that the model can safely decide which actions belong to the task. Once local files and outbound transmission are both reachable, a prompt injection can turn normal execution into data theft. The failure is architectural, because the agent has a privilege path that the model should never have had.

Q: What should teams do when an agentic browser must handle untrusted content?

A: Teams should isolate untrusted content handling from privileged actions and require deterministic barriers before the agent can touch sensitive resources. If the browser can read, interpret, and act on hostile text in the same session, then the trust boundary is too weak for production use.

Technical breakdown

Why prompt injection defeats soft guardrails

Soft guardrails are model-mediated controls, so they depend on the same reasoning surface they are trying to supervise. That makes them vulnerable to indirect prompt injection, where attacker-controlled content is interpreted as task-relevant instruction. In agentic systems, the model can merge benign user intent with malicious guidance into a single plan, which is why the boundary between data and command must be enforced outside the model. A prompt filter can reduce obvious abuse, but it cannot reliably distinguish intent under adversarial conditions. This is a structural limitation, not a tuning problem.

Practical implication: Treat prompt-level filtering as detection and friction, not as the primary prevention control.

Hard boundaries as code-level capability control

Hard boundaries remove a capability rather than trying to persuade the model to avoid it. That means the enforcement point sits in the environment, file system, browser, OS, or network layer, where the LLM cannot override it. In the Comet case, blocking file:// access is effective because it cuts off the path the attacker needed for local file traversal and exfiltration. This is the same security logic as least privilege and attack surface reduction, applied to an agent that can act on its own. The model may still misinterpret content, but it cannot exercise a banned capability if the boundary is deterministic.

Practical implication: Move high-risk actions behind environment-level controls that are impossible for the model to bypass.

Intent collision and delegated browser risk

Intent collision is what happens when an agent combines user-requested action with attacker-injected instructions into one execution sequence. That problem is especially dangerous in delegated browsers and copilots because the agent is operating with the user’s trust and the system’s reach at the same time. Once the chain includes navigation to local resources and outbound transmission, the issue stops being semantic and becomes a privilege boundary failure. The architecture is assuming that the agent can safely decide what counts as part of the task. In adversarial settings, that assumption breaks down quickly.

Practical implication: Restrict delegated browsing, file access, and egress in the same trust domain rather than leaving them coupled by default.

Threat narrative

Attacker objective: The attacker sought to make the agent disclose local files and transmit sensitive data off the machine without triggering a meaningful preventive control.

Entry occurred through a malicious indirect prompt injection embedded in a calendar invitation description, which the agent read as part of normal task intake.
Credential access and data exposure followed when the agent was induced to traverse local file paths and read sensitive files from the user’s machine.
Impact was achieved by exfiltrating the file contents through browser navigation to an attacker-controlled server, turning ordinary action into data theft.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Soft guardrails are a detection layer, not a security boundary. Probabilistic controls can add friction and visibility, but they cannot reliably prevent an agent from acting on malicious instructions embedded in untrusted content. The PerplexedComet chain shows that when the model is allowed to arbitrate trust inside the prompt, the attacker and the user can be merged into a single execution plan. Practitioners should read that as a boundary failure, not a tuning issue.

Hard boundaries expose the real governance unit: capability, not intent. If an agent can reach local files, network egress, or credentials, then governance depends on the architecture constraining those capabilities outside the model. This is where least privilege becomes operational rather than aspirational. The implication is that security teams must stop treating prompt quality as the control plane and start treating reachable capability as the control plane.

Intent collision is the named concept this incident sharpens for the market. It is the failure mode where benign user intent and attacker-controlled instructions are merged by an agent into one course of action. That concept matters because it explains why model alignment alone cannot separate legitimate work from malicious task framing. The practical conclusion is that agentic AI programmes need enforceable boundaries before they need smarter prompts.

Security programmes that rely on human review after agent execution are structurally behind the attack. Once the agent has already read, combined, and transmitted data, review becomes forensics, not prevention. This is the same lesson identity teams learned from excessive standing privilege: the control must exist before the action, not after the trace. The implication is that identity governance for agentic AI must be built around precluded actions, not post hoc approval.

Zenity’s disclosure confirms that agentic AI security now sits inside mainstream identity governance, not beside it. The question is no longer whether the agent is clever enough to follow instructions, but whether it is fenced tightly enough to make some actions impossible. That is a Zero Trust-style problem applied to autonomous execution, and it belongs in the same control conversations as NHI containment, PAM boundaries, and environment-level enforcement. Practitioners should evaluate agent deployments as privileged identities with constrained blast radius.

From our research:
98% of companies plan to deploy even more AI agents within the next 12 months, despite documented rogue behaviour in 80% of current deployments, according to AI Agents: The New Attack Surface report.
Only 44% of organisations have implemented any policies to govern AI agents, even though 92% say governance is critical to enterprise security.
For a broader governance baseline, see OWASP Agentic Applications Top 10 for the control patterns most teams should pressure-test first.

What this signals

With 98% of companies planning to deploy more AI agents in the next 12 months, the governance gap is no longer theoretical. The practical question is whether deployment teams can enforce hard boundaries before agent reach expands faster than identity controls can be redesigned.

Intent collision: this is the operational failure mode teams should now watch for in agentic browsers, copilots, and code assistants. If a single execution path can combine untrusted content with privileged action, then your programme is already depending on the model to solve a problem that belongs in environment-level enforcement.

The most useful benchmark is not how persuasive an agent sounds, but whether it can still be forced to fail safely when hostile content is present. That aligns directly with the boundary-first logic in the OWASP Agentic Applications Top 10 and with Zero Trust-style containment for high-risk actions.

For practitioners

Enforce deterministic capability blocks Remove high-risk functions such as local file access, clipboard access, and arbitrary egress from agent runtimes at the code or policy layer. Do not rely on prompt instructions or confirmation text to compensate for reachable capabilities.
Separate trusted work from untrusted content paths Route external content, such as calendar descriptions, web pages, and inbox items, through a distinct ingestion path that cannot directly trigger privileged agent actions. Treat content parsing as untrusted input handling, not as task execution.
Map agent privileges as enforceable identity boundaries Inventory which resources each agent can reach, then reduce those permissions to the minimum set needed for the task. Apply the same blast-radius discipline used for service accounts and privileged workloads.
Test for intent collision before production rollout Red-team agents with malicious instructions embedded inside otherwise legitimate user content, then verify whether the system can still separate request from payload. If it cannot, the deployment is not yet safe for sensitive data or endpoints.

Key takeaways

Agentic AI security fails when teams rely on soft guardrails to separate trusted intent from malicious input.
The PerplexedComet case shows that deterministic capability blocks matter more than prompt-level caution when file access and egress are in scope.
Practitioners should govern agents as privileged identities with hard limits on what they can reach, read, and transmit.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	NHI-01	Prompt injection and tool misuse are central to this agentic browser case.
NIST AI RMF	GV.1	Agent governance and accountability are the core issue in this disclosure.
NIST Zero Trust (SP 800-207)	PR.AC-4	Least-privilege containment is the practical defense against agent exfiltration.

Remove high-risk tool paths from agents and enforce deterministic execution boundaries.

Key terms

Hard Boundary: A hard boundary is a deterministic control that removes or blocks a capability at the environment, code, or system layer. It does not depend on the model making the right judgment, so it remains effective even when the agent is manipulated by hostile input.
Soft Guardrail: A soft guardrail is a probabilistic control that tries to detect, discourage, or shape unsafe agent behaviour through prompts, policies, or behavioural checks. It can add friction and visibility, but it cannot be relied on as the sole prevention layer in adversarial conditions.
Intent Collision: Intent collision occurs when an agent merges legitimate user intent with attacker-controlled instructions into one execution plan. The resulting action sequence may look coherent to the model, but it is actually the product of mixed trust sources and therefore unsafe to treat as a valid task.
Deterministic Capability Control: Deterministic capability control means limiting what an agent can do through code or system policy, not through model persuasion. In agentic AI, this is the difference between asking for safe behaviour and making unsafe behaviour impossible within the runtime boundary.

Deepen your knowledge

Agentic AI hard boundaries are a core topic in our NHI Foundation Level course, the industry's only accredited NHI security programme. If your team is defining agent controls for browsers, copilots, or tool-using assistants, that governance lens is worth applying early.

This post draws on content published by Zenity: Why Soft Guardrails Get Us Hacked: The Case for Hard Boundaries in Agentic AI. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-03-13.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org