Subscribe to the Non-Human & AI Identity Journal

Instruction Hierarchy

Instruction hierarchy is the order of authority a model applies when interpreting context, system prompts, role messages, and user input. When attackers can influence that hierarchy through templates or wrappers, they can steer behaviour without needing to change the model itself.

Expanded Definition

Instruction hierarchy is the order of authority a model uses when resolving competing inputs, typically giving highest priority to system instructions, then developer or role messages, then user content, and finally lower-trust context such as retrieved text or templates. In NHI and agentic AI systems, that ordering becomes a security boundary because tool calls, policy rules, and workflow wrappers can all be used to steer behaviour. The concept is closely related to prompt injection defence, but it is not limited to prompts alone. It also covers orchestration layers, agent memory, and any component that can silently override intended constraints. Industry usage is still evolving, so different vendors describe the same risk as instruction precedence, message authority, or control-plane dominance. The practical point is consistent: if lower-trust content can outrank higher-trust policy, the model can be induced to ignore restrictions or expose secrets. The most common misapplication is treating all context as equally trustworthy, which occurs when application builders concatenate retrieved text, user input, and policy text into a single prompt without preserving authority boundaries.

For a broader governance lens, NIST Cybersecurity Framework 2.0 helps organisations map this kind of control failure to access and resilience outcomes, while NHI Mgmt Group’s Ultimate Guide to NHIs frames the surrounding identity risks that instruction hierarchy can amplify.

Examples and Use Cases

Implementing instruction hierarchy rigorously often introduces workflow friction, requiring organisations to weigh model flexibility against stronger separation between trusted policy and untrusted content.

  • A customer support agent reads a user email that includes hidden instructions to reveal internal account data. The application must ensure the email content never outranks the system policy that blocks disclosure.
  • A code assistant receives retrieved documentation and a developer prompt that sets tool-use limits. The orchestration layer should preserve the developer instruction as higher authority than the retrieved text, even if the document appears authoritative.
  • An NHI rotation agent is asked by a user to delay revocation during an incident. The agent must follow the system-level revocation policy, not the conversational request, because the request is lower-trust input.
  • In retrieval-augmented generation, a malicious knowledge-base entry tries to redirect the model to an external endpoint. Alignment with the NIST Cybersecurity Framework 2.0 supports separating policy enforcement from content retrieval.
  • Security teams reviewing agent behaviour can use the Ultimate Guide to NHIs to connect instruction precedence failures to broader secret exposure and privilege misuse patterns.

Why It Matters in NHI Security

Instruction hierarchy matters because NHI systems often act on behalf of production workloads, secrets stores, CI/CD pipelines, and privileged automation. When the hierarchy is confused, an attacker does not need to break cryptography or steal a token outright. They can instead shape the model’s interpretation of context so that the agent leaks credentials, approves unsafe tool use, or bypasses revocation logic. That makes the issue especially dangerous in environments where agents handle secrets or administrative actions under time pressure. The NHI Mgmt Group reports that 79% of organisations have experienced secrets leaks, and instruction hierarchy failures can be one of the pathways that turns a harmless-looking prompt into a credential incident. Governance teams should treat the hierarchy as a design control, not a prompt-tuning preference, because security depends on which inputs are allowed to override others. Organisations typically encounter the consequence only after an agent has already misrouted data or executed an unsafe tool call, at which point instruction hierarchy becomes operationally unavoidable to address.

From a control perspective, the NIST Cybersecurity Framework 2.0 supports mapping these failures to governance, access, and recovery outcomes, while the NIST Cybersecurity Framework 2.0 reinforces the need to separate trusted policy from untrusted input in agent workflows.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Agentic AI Top 10 Covers agent prompt injection and instruction conflicts that alter model behaviour.
NIST CSF 2.0 PR.AC-4 Instruction hierarchy failures can bypass intended access and authority boundaries.
OWASP Non-Human Identity Top 10 NHI-02 Instruction steering can expose secrets and weaken NHI control enforcement.

Preserve a strict trusted instruction layer and block lower-trust content from overriding agent policy.