Subscribe to the Non-Human & AI Identity Journal

Prompt Leaking

Prompt leaking is the extraction of hidden instructions, examples, or system context that shapes an LLM’s behavior. Security teams care because leaked context can reveal guardrails, internal logic, or sensitive data paths that help an attacker refine later abuse or impersonation attempts.

Expanded Definition

Prompt leaking happens when hidden prompt content, system instructions, tool directives, examples, or policy text is exposed through model interaction. In NHI and agentic AI settings, that exposure matters because prompts often encode guardrails, tool scope, or routing logic that an attacker can reuse.

Definitions vary across vendors because some treat prompt leakage as a pure information disclosure issue, while others fold it into broader prompt injection or model extraction risk. The more precise view is operational: if an agent, assistant, or MCP-connected workflow reveals internal instructions that should have remained private, the system has leaked context. That context may not be a credential, but it can still help an adversary impersonate operators, bypass safety constraints, or map the path to secrets and actions. Guidance in the field is still evolving, so teams should avoid assuming that any single content filter fully solves the problem. For a broader NHI context, see Ultimate Guide to NHIs — Why NHI Security Matters Now and the Anthropic — first AI-orchestrated cyber espionage campaign report.

The most common misapplication is treating prompt leakage as harmless text exposure, which occurs when organisations ignore hidden instructions that reveal tool names, access paths, or security rules.

Examples and Use Cases

Implementing prompt leakage controls rigorously often introduces workflow friction, requiring organisations to weigh model transparency and supportability against the risk of exposing operational context.

  • A support chatbot returns its hidden system prompt after a carefully phrased user request, exposing escalation logic and prohibited topics.
  • An AI agent connected to a ticketing platform discloses tool descriptions, helping an attacker learn which actions can be triggered with crafted inputs.
  • A coding assistant echoes internal examples that include API endpoints or secret-handling patterns, creating a path to follow-on abuse.
  • A red team tests whether prompt templates can be extracted from a multi-turn conversation before the attacker moves to impersonation or prompt injection.
  • A security review maps leaked instructions to NHI controls and compares the incident with patterns discussed in The 52 NHI breaches Report and Guide to the Secret Sprawl Challenge.

In practice, many teams also benchmark this risk against the operational patterns described in 52 NHI Breaches Analysis, because leaked context often becomes valuable only when paired with exposed secrets or overbroad permissions.

Why It Matters in NHI Security

Prompt leaking matters because it can turn a seemingly low-severity AI issue into a governance and access problem. Once hidden instructions are exposed, defenders lose control over how the agent is expected to behave, how it reaches tools, and what a malicious user can target next. In NHI environments, that can accelerate secret discovery, privilege abuse, and fraudulent operator imitation.

The business impact is not theoretical. In Ultimate Guide to NHIs — Why NHI Security Matters Now, 79% of organisations reported secrets leaks, and 77% of those incidents caused tangible damage. That matters here because leaked prompt context often points directly at the same weak spots: embedded credentials, vulnerable workflows, and misconfigured agent permissions. When organisations do not know exactly what an AI system reveals, they also struggle to prove what should have stayed hidden. The safest posture is to assume leaked instructions can be operationally useful to an attacker, even if they do not look sensitive on their own.

Organisations typically encounter the real damage only after an agent has exposed instructions, leaked tool details, or helped an attacker stage a second-stage abuse path, at which point prompt leaking becomes operationally unavoidable to address.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Agentic AI Top 10 LLM-03 Covers prompt injection and output exposure risks in agentic systems.
OWASP Non-Human Identity Top 10 NHI-07 Addresses secret exposure and misuse in non-human identity workflows.
NIST CSF 2.0 PR.DS-5 Protects data integrity and confidentiality against unintended disclosure.

Restrict hidden instructions and test agents for context disclosure before production release.