Subscribe to the Non-Human & AI Identity Journal

Prompt Extraction

Prompt extraction is the act of using carefully crafted inputs to make a model reveal hidden instructions, embedded data, or connected-source content. It does not require a classic compromise path. The model may still appear to be functioning normally, which makes the disclosure harder to detect with traditional logging.

Expanded Definition

Prompt extraction is a form of model abuse where an attacker uses iterative prompts, role-play, or carefully shaped requests to elicit hidden instructions, embedded system data, or connected-source content. In NHI and agentic AI environments, the target is often not the model itself but the sensitive context surrounding it, including tool outputs, internal policies, and retrieved documents.

Unlike a conventional compromise, prompt extraction can occur while the model appears to answer normally, which makes detection harder for standard application logs and SIEM rules. The term is related to prompt injection, but it is narrower: prompt injection is the broader technique for steering model behavior, while prompt extraction focuses on disclosure. Definitions vary across vendors, and no single standard governs this yet, so governance teams should treat it as an application-layer confidentiality risk across the full AI stack. For baseline security framing, NIST Cybersecurity Framework 2.0 is useful for mapping detection and response obligations, even though it does not name the attack class directly.

The most common misapplication is treating prompt extraction as harmless model curiosity, which occurs when teams assume a normal-looking response means no sensitive disclosure has occurred.

Examples and Use Cases

Implementing safeguards against prompt extraction rigorously often introduces friction in retrieval, tool access, and debugging, requiring organisations to weigh user experience and observability against tighter disclosure controls.

  • An employee asks an internal assistant to repeat its hidden system instructions, then gradually coaxes out policy text that was never meant for end users.
  • A support bot connected to knowledge bases reveals snippets of incident notes after a sequence of harmless-looking follow-up questions.
  • An agent integrated with CI/CD tooling exposes build-time secrets or configuration values when queried through a deceptive troubleshooting flow.
  • A retrieval-augmented application returns context from a document store because a crafted prompt bypasses the intended answer boundary.
  • An attacker probes an assistant until it discloses connected-source content, then uses that data to pivot toward service accounts or API keys referenced in the material.

These cases are especially relevant where model outputs are stitched to internal systems. The Ultimate Guide to NHIs shows how widely secrets and service identities are distributed across enterprises, which expands the amount of context a model can accidentally expose. For implementation patterns and threat terminology, OWASP Top 10 for Large Language Model Applications remains a practical reference point.

Why It Matters in NHI Security

Prompt extraction matters because it can turn a routine AI interaction into a disclosure event affecting secrets, service account details, internal prompts, and connected-source content. In NHI programs, that matters most when AI assistants are allowed to touch vaults, ticketing systems, source control, or identity workflows. If an assistant reveals the wrong context, the downstream issue is often not just a data leak but a path to impersonation, lateral movement, or unauthorized tool execution.

NHI Mgmt Group reports that 79% of organisations have experienced secrets leaks, with 77% of these incidents resulting in tangible damage, which shows why disclosure through AI interfaces cannot be treated as theoretical. Prompt extraction also complicates governance because the model may not violate a command boundary; it may simply reveal what was embedded upstream. Controls should therefore cover prompt design, tool scoping, retrieval filtering, and output review, aligned to NIST Cybersecurity Framework 2.0 and the confidentiality outcomes it supports.

Organisations typically encounter this consequence only after an assistant has already exposed internal instructions or source data to an unauthorised user, at which point prompt extraction becomes operationally unavoidable to address.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Agentic AI Top 10 Covers prompt abuse and disclosure risks in agentic and LLM systems.
OWASP Non-Human Identity Top 10 NHI-05 Prompt extraction can reveal secrets and connected-source content tied to NHIs.
NIST CSF 2.0 PR.DS-1 Addresses protection of data at rest and in transit, including AI-disclosed data.

Classify AI-accessible data and enforce controls that reduce unintended disclosure through models.