Subscribe to the Non-Human & AI Identity Journal

System Prompt Leakage

System prompt leakage is the exposure of hidden prompt content to users or attackers. The real security problem is usually not the prompt itself, but the secrets, policy logic, and internal architecture details placed inside it. If those details are sensitive, they should live in code or secrets management instead.

Expanded Definition

System prompt leakage occurs when hidden instructions for an AI agent or chatbot are exposed to a user, attacker, or downstream system. In practice, the risk is not limited to the wording of the prompt itself. It becomes a security issue when teams embed secrets, policy exceptions, internal tool names, or architecture details inside the prompt, because those details can reveal how to bypass controls or target connected systems. For agentic systems, this overlaps with broader NHI governance because the model often acts with tool access and execution authority, similar to an Anthropic report on AI-orchestrated cyber espionage findings about prompt-driven abuse and operational misuse. Definitions vary across vendors on whether leakage includes partial instruction disclosure, indirect inference, or full prompt extraction, so governance language should be precise.

The most common misapplication is treating the system prompt as a safe place for long-lived credentials or privileged logic, which occurs when product teams use prompts as a shortcut for configuration management.

Examples and Use Cases

Implementing controls against system prompt leakage often introduces product friction, requiring organisations to balance conversational flexibility against tighter secrecy and audit requirements.

  • A support bot is instructed to reveal only approved troubleshooting steps, but the hidden prompt also contains an internal escalation path that an attacker extracts and uses to probe admin workflows.
  • An agentic workflow stores API keys in the prompt for convenience, which creates a direct link between prompt exposure and secret compromise, a pattern discussed in the Guide to the Secret Sprawl Challenge.
  • A retrieval-augmented assistant includes private policy text in its system instructions, and that text is later echoed by the model when a user uses adversarial prompting to test boundaries, echoing the breach patterns covered in the The 52 NHI breaches Report.
  • An AI agent is granted tool access under a hidden prompt policy, but the actual control should have been enforced in code and access governance, not in prose instructions alone.

In mature environments, prompt content is kept minimal, while secrets live in dedicated vaults and policy enforcement sits in application logic, aligned with the operational guidance in the Ultimate Guide to NHIs — Why NHI Security Matters Now and the same cautionary pattern seen in Anthropic’s report.

Why It Matters in NHI Security

System prompt leakage matters because AI agents increasingly function as NHIs: they authenticate, call tools, process sensitive context, and sometimes make decisions that affect production systems. If a leaked prompt reveals how an agent is authorised, what tools it can reach, or where secrets are stored, the exposure becomes a pathway to lateral movement and privilege abuse. This is especially important because the Ultimate Guide to NHIs — Why NHI Security Matters Now shows that 96% of organisations store secrets outside secrets managers in vulnerable locations including code, config files, and CI/CD tools, making prompt-based leakage part of a wider secret sprawl problem. For teams building agentic systems, the right lesson is to treat prompts as disposable instruction text, not as a control plane or secret store. That aligns with the operational warnings in Guide to the Secret Sprawl Challenge and the breach evidence in 52 NHI Breaches Analysis.

Organisations typically encounter the consequence only after an agent leaks internal instructions during testing or an incident, at which point system prompt leakage becomes operationally unavoidable to address.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Agentic AI Top 10 LLM-04 Prompt leakage exposes hidden instructions and tool-use policies to adversarial extraction.
OWASP Non-Human Identity Top 10 NHI-02 Secret sprawl and improper secret storage are core risks when prompts contain sensitive data.
NIST Zero Trust (SP 800-207) AC-4 Zero Trust requires policy enforcement outside the model so leaked prompts cannot grant access.

Enforce least privilege at the application boundary and never trust prompt text as control logic.