System instructions are persistent policy rules that define how the model should behave across sessions. User prompts are the task requests entered at runtime. Security teams should govern the first as controlled policy and treat the second as untrusted input that can still try to bypass guardrails or steer the model into unsafe actions.
Why This Matters for Security Teams
The distinction sounds simple, but it drives how controls should be designed. System instructions shape the model’s standing behaviour, while user prompts arrive as untrusted runtime input that can be adversarial, malformed, or manipulative. That means policy, safety boundaries, and tool-use restrictions belong in the system layer, not in whatever the user happens to ask. This is especially important when LLMs are connected to tools, data stores, or identity systems.
Security teams often miss the operational impact: prompt content can try to override intent, exfiltrate context, or steer the model into actions that were never approved. In agentic environments, that becomes more than a content-safety issue because the model may have execution authority. Current guidance suggests treating prompts like any other attacker-controlled input and designing system instructions as controlled policy, not as a hidden safety net. The same mindset applies across NHI governance, where Ultimate Guide to NHIs — What are Non-Human Identities frames machine identities as assets that need explicit control boundaries, not implicit trust.
Anthropic’s Anthropic Project Glasswing work reinforces a practical point: model behaviour must be evaluated in context, especially when instructions and tools intersect. In practice, many security teams encounter prompt injection only after a tool call or data leak has already happened, rather than through intentional policy testing.
How It Works in Practice
At runtime, the model receives a layered instruction stack. System instructions set policy intent, define prohibited behaviours, and constrain tool use. Developer instructions usually add application-specific workflow rules. User prompts then request a task, but they should never be treated as trusted policy. The safest pattern is to keep the system layer narrow, explicit, and testable, then validate all user input before it influences retrieval, function calls, or downstream actions.
For AI security teams, that means separating instruction authority from data handling. A prompt can ask the model to “ignore previous instructions,” but that does not change the governing policy if the system layer is properly enforced. The same principle shows up in DeepSeek breach analysis, where exposed secrets and leaked data show how quickly AI-related trust boundaries collapse when inputs, outputs, and credentials are not separated cleanly.
- Keep system instructions static, minimal, and version-controlled.
- Treat user prompts as untrusted input, even when they come from employees.
- Bind tool access to policy checks outside the model, not just natural-language guardrails.
- Log instruction changes, prompt attempts, and tool decisions separately for auditability.
- Test for prompt injection, data exfiltration, and instruction hierarchy conflicts before release.
For agentic systems, this becomes even stricter. If an AI agent can call APIs or act on a workflow, the real control point is not the prompt itself but the authorization layer around tool execution. The CSA MAESTRO agentic AI threat modeling framework and NIST AI risk guidance both point toward runtime policy evaluation, not blind trust in natural-language instructions. These controls tend to break down when legacy apps let prompts flow directly into privileged actions because the model then inherits too much authority from the application.
Common Variations and Edge Cases
Tighter instruction control often increases engineering overhead, requiring organisations to balance safety against speed, flexibility, and product usability. That tradeoff is real, especially in systems that personalise responses or use multiple prompt layers.
One common edge case is prompt chaining, where a benign user request becomes dangerous only after retrieval, summarisation, or tool invocation. Another is multi-tenant environments, where one tenant’s prompt can contaminate shared context if isolation is weak. Best practice is evolving here, and there is no universal standard for how much instruction content should be exposed to users versus hidden behind policy abstraction. What is consistent is the need to keep secrets, credentials, and action permissions outside the prompt path.
Agentic systems raise the stakes further because the model may pursue goals in ways the operator did not anticipate. That is why AI security teams should think in terms of instruction hierarchy plus execution governance: the system tells the model what it may do, the prompt states what the user wants, and the authorization layer decides whether the action is allowed at all. For broader NHI context, the Ultimate Guide to NHIs — What are Non-Human Identities remains useful for understanding why identity, not language, is the real enforcement surface. In practice, teams usually discover this distinction only after a jailbreak attempt, unsafe tool call, or unintended data exposure has already occurred.
Standards & Framework Alignment
This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.
OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.
| Framework | Control / Reference | Relevance |
|---|---|---|
| OWASP Agentic AI Top 10 | A2 | Prompt injection and instruction hierarchy are core agentic AI risks. |
| CSA MAESTRO | MAESTRO covers runtime threat modeling for tool-using AI agents. | |
| NIST AI RMF | AI RMF applies governance to model behaviour, misuse, and operational risk. |
Define trusted instruction layers and block user prompts from directly driving privileged agent actions.
Related resources from NHI Mgmt Group
- What is the difference between managed identities and hardcoded secrets for AI agents?
- What is the difference between human identity governance and AI agent governance?
- What is the difference between workload identity and API keys for AI agents?
- What is the difference between governing human access and governing AI agent access?