Subscribe to the Non-Human & AI Identity Journal

What breaks when malicious instructions are embedded in a Claude Code project file?

The trust model breaks because the assistant treats repository context as inherited authorization. A malicious file can cause the agent to justify harmful actions as approved work, even when the same request would be refused in a direct prompt. Teams should treat project instruction files as security-sensitive inputs, not harmless documentation.

Why This Matters for Security Teams

Malicious instructions in a Claude Code project file are dangerous because the agent does not experience the file as “just text.” It treats repository context as part of the work scope, which can convert an attacker-controlled instruction into apparent authority. That shifts the problem from prompt injection to delegated execution, where the agent may justify file access, code changes, or tool use as if they were legitimate tasks.

This is why current guidance on agentic systems emphasizes context-aware control rather than trusting static prompts or repository boundaries. NIST’s NIST Cybersecurity Framework 2.0 supports the broader principle of protecting assets and managing access, but agentic workflows add a sharper edge: the instruction source itself can become the attack path. NHI Management Group’s Analysis of Claude Code Security highlights that these project-scoped instructions should be treated as security-sensitive inputs, not developer convenience files.

In practice, many security teams encounter this only after an agent has already followed a poisoned instruction file and produced unauthorised tool actions that looked like normal productivity work.

How It Works in Practice

The failure mode starts when the agent reads a project file and merges it into its working context as if it were trusted operational guidance. If that file contains hostile instructions, the model may reinterpret user intent, broaden scope, or suppress safeguards because the repository content appears to come from the project owner. The result is not simply “bad advice.” It is an authorisation confusion problem.

In agentic environments, the safer pattern is to separate three things: what the agent can read, what it can decide, and what it can execute. That usually means policy enforcement at request time, not just at setup time. A runtime policy layer can compare the current task, file origin, tool call, and target resource before allowing action. That aligns with the direction of NIST Cybersecurity Framework 2.0, and with current agentic security guidance that treats tool use as a privileged operation.

  • Classify project instruction files as untrusted inputs unless they are explicitly signed and approved.
  • Use just-in-time credential issuance so the agent only receives short-lived access for the current task.
  • Bind tool permissions to workload identity, not to a static “developer assistant” role.
  • Evaluate each tool request against live policy, including file provenance and user intent.
  • Log when an instruction file changes the agent’s plan, because that often signals prompt injection.

This is also where NHI governance matters: long-lived secrets or broad API keys make poisoned instructions far more damaging. NHIMG’s Ultimate Guide to Non-Human Identities notes that 97% of NHIs carry excessive privileges, which is exactly the kind of condition that turns a malicious project file into a high-impact event. These controls tend to break down in environments where the agent can chain multiple tools without per-call policy checks because the initial poisoned instruction then propagates across the full workflow.

Common Variations and Edge Cases

Tighter instruction handling often increases developer friction, requiring organisations to balance safety against speed. Best practice is evolving, and there is no universal standard for how much repository context an agent should trust by default.

One edge case is a legitimate project file that is later modified by a compromised contributor or build pipeline. Another is an agent that reads instructions from multiple sources and silently prefers the most permissive one. A third is environments that rely on broad workspace credentials, where even a small instruction file can trigger large-scale impact. In those cases, the issue is not just malicious text, but the combination of untrusted context and overbroad privilege.

Practitioners should also avoid assuming that human review alone solves the problem. If the agent can act before review, or if the review happens after an automated commit, the malicious instruction has already influenced the outcome. The safer model is layered: provenance checks for project files, scoped credentials, and explicit approval for high-risk tool actions. That aligns with Analysis of Claude Code Security and current agentic guidance from NIST Cybersecurity Framework 2.0.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Agentic AI Top 10 A2 Prompt injection in project files is a core agentic attack path.
CSA MAESTRO M1 Covers agent trust boundaries and instruction injection risks.
NIST AI RMF GOVERN Requires accountability for AI system behavior and input provenance.

Treat repository instructions as untrusted input and gate tool use with runtime policy.