Why do AI agent sandboxes still leak secrets even when egress policies are enforced?

Why This Matters for Security Teams

Sandboxing and egress filtering are necessary, but they do not solve the core problem: an AI agent can turn permitted tools into an exfiltration path without ever violating a network rule. If the workload can read repositories, install packages, inspect logs, or call internal APIs, the sandbox may still be doing exactly what it was told while the agent is doing something the operator did not intend. That is why this issue shows up so often in agentic systems covered in the OWASP NHI Top 10 and the NIST AI Risk Management Framework: the failure is not only perimeter escape, but intent ambiguity.

Current guidance suggests that the real control boundary for autonomous workloads is not just “can it reach the internet,” but “what task is it authorized to complete right now, with which data, and under what policy context.” That is why NHI governance for agents needs to account for tool chaining, latent retrieval, and prompt-driven behaviour, not only outbound destinations. In practice, many security teams encounter secret leakage only after an agent has already copied credentials into a build artifact, package manifest, or chat transcript, rather than through intentional review of the sandbox design.

How It Works in Practice

For AI agents, the operational model has to shift from static access to runtime authorization. Egress policy is still useful, but it should be treated as a last-line constraint, not the main safeguard. The stronger pattern is to bind the agent to a workload identity, issue short-lived credentials per task, and evaluate policy at request time based on the current action, data source, and destination. This is where emerging practice aligns with CSA MAESTRO agentic AI threat modeling framework and OWASP Agentic AI Top 10, both of which emphasize that an autonomous system can combine individually legitimate actions into a harmful sequence.

In practical terms, teams should assume that an agent can misuse any tool it is granted. A package install step can fetch attacker-controlled content. A source-control action can expose tokens in configuration history. A retrieval step can surface secrets from documents the operator did not expect to be high risk. The question is therefore not only whether traffic is allowed, but whether the task itself should be allowed to touch sensitive objects at all. NHIMG has repeatedly shown that secret exposure is often chained across normal workflows, as seen in the Guide to the Secret Sprawl Challenge and the Reviewdog GitHub Action supply chain attack.

Use JIT credentials with strict TTLs so the agent only has secrets while a task is active.

Prefer workload identity and policy-as-code over long-lived static tokens.

Log tool calls and data access separately from network egress.

Deny access to secret-bearing repositories, artifacts, and chat channels unless the task explicitly requires them.

These controls tend to break down in highly permissive developer sandboxes where agents can install arbitrary dependencies and reuse cached credentials across sessions because the environment makes every approved action a potential exfiltration step.

Common Variations and Edge Cases

Tighter sandboxing often increases friction, requiring organisations to balance agent autonomy against developer productivity and task completion speed. That tradeoff is real, especially in research, coding, and incident-response workflows where the agent legitimately needs broad access for short periods. Best practice is evolving, and there is no universal standard for how much autonomy should be pre-approved versus dynamically granted.

One common edge case is internal-only leakage. Even if internet egress is blocked, an agent can still leak secrets into build logs, ticketing systems, code comments, or other internal services that later sync outward. Another is cross-tool escalation, where a harmless-seeming command chain becomes dangerous once the agent has access to package managers, SCM hooks, or retrieval plugins. This is why the NIST AI Risk Management Framework and the MITRE ATLAS adversarial AI threat matrix are useful complements: they force teams to reason about misuse paths, not just network paths.

NHIMG’s research also shows how confidence can lag behind reality. In The State of Secrets in AppSec, the average estimated time to remediate a leaked secret is 27 days, which means a single sandbox failure can persist long after the initial event. The practical takeaway is simple: if an agent can read it, retrieve it, or format it, assume it can leak it unless policy blocks the action at the moment of use.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A2	Agent tool chaining can turn allowed actions into secret exfiltration.
CSA MAESTRO	T1	MAESTRO focuses on agentic threat paths beyond simple network egress.
NIST AI RMF	GOVERN	AI RMF governance is needed for intent-based authorization and accountability.

Define ownership, policy, and review for every agent task that can touch secrets.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Why do AI agent sandboxes still leak secrets even when egress policies are enforced?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group