What breaks when an AI system cannot separate instructions from data?

Why This Matters for Security Teams

When an AI system cannot separate instructions from data, the security model stops being about content classification and becomes a control-plane problem. A retrieved paragraph, ticket, or webpage can be treated as actionable instruction, which means the model can be steered without any UI compromise. That is why classic prompt filters and keyword blocking are incomplete: they inspect text, not authority. NIST Cybersecurity Framework 2.0 is useful here because it pushes teams toward governed, risk-based controls rather than brittle content assumptions.

This failure mode is especially dangerous in systems that ingest untrusted sources at scale, including retrieval-augmented generation, browser-enabled agents, and email triage workflows. In those environments, the model may not need to be tricked into “saying” the wrong thing; it only needs to be convinced that hostile data is a valid instruction. NHIMG’s research on the DeepSeek breach shows how quickly sensitive material exposure can cascade once model-adjacent trust boundaries fail. In practice, many security teams discover the issue only after a model has already executed an attacker-shaped workflow rather than through deliberate testing.

How It Works in Practice

The core technical issue is instruction hierarchy. A secure system needs a clear distinction between system policy, user intent, and untrusted content. When that boundary is weak, the model can mistakenly elevate data into instructions, especially when text is embedded in a retrieved document, tool output, or page rendered by an agent. This is where prompt injection becomes operationally relevant: the attacker is not “bypassing” the model so much as persuading it to misread context.

Practical defenses usually combine multiple layers:

Separate trusted instructions from untrusted data before the model sees either.

Label retrieval results and tool outputs as content, not commands.

Use allowlisted tool calls with explicit purpose checks.

Apply policy at runtime, not only at prompt-design time.

Log model decisions and tool invocations for review and containment.

That last point matters because real-time policy enforcement depends on what the model is trying to do, not just what text it received. Current guidance suggests pairing content isolation with control frameworks such as the NIST Cybersecurity Framework 2.0 and the emerging Ultimate Guide to NHIs — Key Research and Survey Results, which reinforces why non-human identities and runtime trust boundaries must be managed together. These controls tend to break down when untrusted text flows directly into tool-using agents because the agent can chain benign-looking retrievals into harmful actions before human review catches up.

Common Variations and Edge Cases

Tighter instruction isolation often increases engineering overhead, requiring organisations to balance safety against latency, retrieval quality, and developer complexity. That tradeoff becomes sharper in environments that blend structured data, free text, and autonomous tool use.

There is no universal standard for this yet, but current guidance suggests treating these cases differently:

Schneider Electric credentials breach illustrates how adjacent identity and access failures can amplify model exposure.

Documents that contain policy text, legal language, or operational checklists can be mistaken for higher-priority instructions if formatting is not preserved.

Multi-agent systems are more fragile because one agent can ingest poisoned data and pass it downstream as trusted context.

The main edge case is hybrid content, where a single source contains both legitimate instructions and adversarial payloads. In those situations, coarse “safe/unsafe” scanning is usually not enough; teams need provenance tracking, context segmentation, and strict tool permissions. The practical test is whether the system can preserve meaning without granting authority to what it reads. If it cannot, the model is operating with a collapsed trust boundary, not a mild filtering gap.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Prompt injection and instruction/data confusion are core agentic AI failure modes.
CSA MAESTRO		MAESTRO addresses agent trust boundaries, tool use, and runtime policy enforcement.
NIST AI RMF		AI RMF covers governance and risk treatment for instruction-following failures.

Separate trusted instructions from untrusted content and validate tool actions at runtime.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

What breaks when an AI system cannot separate instructions from data?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group