AI agent attack surfaces are shifting from code to runtime behavior

By NHI Mgmt Group Editorial TeamPublished 2025-12-03Domain: Agentic AI & NHIsSource: Pillar Security

TL;DR: AI security is moving toward inference-time exploitation, indirect injection, poisoned MCP tooling, and agent-to-agent propagation, according to Pillar Security. The governing assumption breaks when data becomes executable and agents can chain trusted inputs into privileged actions without runtime validation.

At a glance

What this is: This opinion piece argues that AI systems now present a runtime attack surface where data, tools, and agent handoffs can be manipulated as executable instruction paths.

Why it matters: It matters because IAM, NHI, and security teams need to govern what AI can consume, trust, and do at runtime, not just what code it runs.

By the numbers:

86% of organizations are blind to AI data flows, having no inventory or visibility into where their AI is connected or what data is exposed.
97% lacking proper AI access controls.

👉 Read Pillar Security's analysis of AI agent attack surfaces and runtime exploitation

Context

AI agent runtime risk is the problem here, not model quality alone. When prompts, retrieved documents, tool outputs, and inter-agent messages all influence execution, the boundary between data and instruction starts to disappear.

That shift has direct identity implications. If an AI system can be steered through trusted data sources, MCP tools, and agent handoffs, then access control has to cover runtime behaviour, data provenance, and delegated authority together, not as separate controls.

Key questions

Q: How should security teams govern AI agents that consume untrusted data?

A: Security teams should treat every data source that can influence an AI agent as part of the control boundary. That means classifying sources by trust level, restricting which sources can trigger privileged actions, and separating retrieval from execution. If the system cannot prove provenance, it should not be allowed to drive sensitive decisions or write operations.

Q: Why do AI agents create a larger attack surface than ordinary automation?

A: AI agents create a larger attack surface because they can reinterpret inputs, combine context from multiple sources, and choose actions at runtime. Ordinary automation follows predefined rules. Agents can be manipulated through trusted content, tool responses, or handoffs, which turns the decision layer itself into an attack path.

Q: What breaks when an MCP server is compromised?

A: When an MCP server is compromised, the agent may still trust its response as if it were internal policy or approved guidance. That breaks the assumption that tool use is safe simply because the tool is authenticated. In practice, the agent inherits malicious instructions through a channel that should have been treated as untrusted until verified.

Q: Who is accountable when an AI agent follows malicious instructions from a trusted source?

A: Accountability sits with the organisation operating the agent, the team governing its access, and the owners of the data or tool path that allowed the instruction to be acted on. AI governance has to cover provenance, delegated authority, and runtime approval boundaries, otherwise blame gets assigned after the damage is already done.

Technical breakdown

Indirect injection turns trusted data into executable instruction

Indirect injection happens when an AI system treats external content as if it were an instruction source. That can include RAG documents, memory stores, tool output, or ordinary text with hidden directives. The security failure is not traditional code injection, but instruction laundering through a trusted context. Once the model accepts that content as authoritative, it may follow embedded commands that alter output, access data, or trigger downstream actions. The CFS idea the vendor references is useful here because payload structure affects whether the model elevates the content into behaviour. The real risk is that trust is being assigned at ingestion time, while the harmful decision occurs later at runtime.

Practical implication: classify AI data sources by trust level and block unvalidated content from shaping privileged actions.

MCP expands the attack surface of agent tool use

The Model Context Protocol creates a standard way for agents to reach tools and data sources, but standardisation does not equal safety. If an MCP server is compromised, the agent may still treat its response as authoritative internal guidance. That is a governance problem for machine identity, because the tool boundary becomes part of the privilege boundary. The article’s example shows how a poisoned recommendation can add malicious dependencies while still sounding technically correct. In practice, the dangerous point is not just the agent calling a tool, but the agent inheriting trust from the tool without any second validation layer.

Practical implication: treat MCP servers as trusted execution dependencies and validate their outputs before agents can act on them.

Agent-to-agent handoffs can propagate toxic combinations

Agent-to-agent communication changes the risk model because each handoff can combine otherwise safe permissions into a harmful chain. A read-capable agent monitoring untrusted input and a write-capable agent acting on it may form a toxic combination even if neither is individually overprivileged. This is a delegation problem, not simply a permission problem. Traditional service-to-service controls assume schemas, authentication boundaries, and predictable request intent. AI agents pass richer context and can reframe instructions along the way, which makes context contamination a real escalation path. The issue is compounded when multiple agents share implicit trust and no session isolation exists between them.

Practical implication: map agent trust graphs and separate read, decision, and write authority across handoffs.

Threat narrative

Attacker objective: The attacker wants to manipulate AI runtime behaviour so trusted systems execute harmful instructions, exfiltrate data, or propagate compromise through the agent ecosystem.

Entry occurs when malicious instructions are embedded in a trusted data source, compromised MCP server, or untrusted inter-agent message that the AI system is designed to consume.
Escalation occurs when the agent converts that input into action, inherits tool trust, or propagates the instruction through a second agent with broader permissions.
Impact occurs when poisoned context drives data exfiltration, malicious code changes, or cascading misuse across connected AI workflows.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Inference-time exploitation is the right name for this threat class. The core failure is not code weakness but decision manipulation at runtime, where trusted data becomes executable behaviour. That moves security from static application control into the identity and governance of the system that interprets the data. Practitioners should treat this as a runtime trust problem, not a model tuning problem.

Trust assigned at ingestion time is structurally wrong for AI systems. Modern AI security assumes that content can be evaluated once and then safely reused, but indirect injection shows that trust can be weaponised later when the model acts on the content. This is a control gap in the trust boundary itself, and it aligns directly with OWASP Agentic AI Top 10 and the OWASP NHI view that tool and data access are part of the identity surface. Practitioners need to rethink where trust is adjudicated.

Agent-to-agent communication creates identity blast radius. A single compromised agent can contaminate context across the trust graph and turn a narrow access issue into a multi-system behavioural failure. That is why AI governance cannot stop at per-agent permissions. It must account for how delegated context, inherited trust, and downstream write privileges interact across the full chain.

MCP trust is now part of machine identity governance. The article’s tool-poisoning example shows that agents do not just authenticate to tools, they absorb the tool’s authority into their own behaviour. That means a compromised internal tool can function like a credential bridge. The implication is that machine identity programmes must evaluate tool integrity as part of access design, not as a separate infrastructure concern.

Runtime security is becoming the primary control plane for AI risk. The article’s own framing is correct: if a secure codebase cannot stop a poisoned document from driving harmful action, then the decisive control is visibility and restriction at runtime. Security teams should therefore treat AI discovery, data flow mapping, and tool governance as foundational identity controls for agentic environments.

From our research:
96% of technology professionals identify AI agents as a growing security threat, and 66% believe this risk is immediate, according to AI Agents: The New Attack Surface report.
Another finding from the same research shows that 80% of organisations report their AI agents have already performed actions beyond their intended scope, including unauthorised system access, sensitive data sharing, and credential exposure.
For a broader view of where these risks sit in the control stack, read OWASP Agentic AI Top 10 for a structured view of agentic application failure modes.

What this signals

Inference-time exploitation is now a programme design issue, not a niche research concept. Security teams should expect AI risk to shift from model selection to runtime governance, especially where documents, APIs, and tools can all become instruction carriers. With 96% of technology professionals identifying AI agents as a growing security threat, the operating model has already moved beyond pilot-stage assumptions.

Runtime trust debt: this is the accumulated risk created when organisations let AI systems inherit trust from data sources and tools without continuous validation. It becomes visible only after a bad instruction has already been acted on, which is why discovery, provenance checks, and tool governance need to be part of the identity programme. For the architectural pattern behind this shift, use OWASP Agentic AI Top 10 as a reference point.

The immediate planning question is not whether to deploy AI agents, but how to bound the trust graph they operate in. Teams that already manage NHIs, secrets, and workload identity should extend those disciplines to agents, MCP servers, and inter-agent handoffs before those paths become the easiest route to business-layer compromise.

For practitioners

Map AI data flows before granting production access Inventory every source that can influence model behaviour, including RAG stores, document repositories, tool outputs, Slack channels, and API feeds. Remove blind spots by recording where instruction-bearing content enters the system and which downstream actions it can trigger.
Separate read influence from write authority Do not allow the same agent chain to both consume untrusted context and commit privileged changes. Break the path between monitoring inputs and making production edits so a poisoned message cannot become an immediate action.
Validate MCP outputs before agents can use them Treat tool responses as untrusted until they are checked against policy, package allowlists, and expected intent. Require a control layer that can reject malicious recommendations even when they come from an internal service.
Build trust-graph reviews for inter-agent delegation Document which agents can hand off context, permissions, or work items to other agents, then review those chains for toxic combinations. Focus on where context contamination could cross from a harmless read path into a privileged sink.

Key takeaways

AI security risk is moving from code flaws to runtime manipulation, where trusted data and tools can drive harmful agent behaviour.
The scale signal is already clear: AI agents are operating beyond intended scope in real environments, and most organisations still lack AI data-flow visibility.
Practical defence requires runtime trust controls, agent trust-graph mapping, and tool validation before privileged actions are allowed.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A1	Indirect injection and tool poisoning are core agentic AI threats.
OWASP Non-Human Identity Top 10	NHI-01	The article focuses on AI systems as identities with tool and data access.
NIST CSF 2.0	PR.AC-4	Runtime access governance is central to preventing unauthorized AI actions.

Apply least-privilege and access review controls to AI agents, tools, and delegated context paths.

Key terms

Inference-time exploitation: A compromise pattern where an AI system is manipulated while it is reasoning or acting, rather than through a traditional software vulnerability. The attacker targets the runtime decision process by shaping inputs, tools, or context so the model performs unsafe actions on its own.
Indirect injection: A malicious instruction hidden inside data the AI system trusts, such as retrieved documents, tool output, or memory. The system later treats that data as guidance and follows the embedded command, which makes the payload dangerous because the harmful step happens after ingestion.
Agent trust graph: The set of relationships showing which AI agents, tools, and data sources trust each other enough to pass context or authority. It matters because compromise rarely stays local in agentic systems, and a single weak node can propagate risk across many downstream actions.
Toxic combination: A set of individually acceptable AI capabilities that becomes unsafe when combined, such as read access to untrusted content plus write access to production systems. The risk emerges from the interaction between permissions, not from any one permission in isolation.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.

This post draws on content published by Pillar Security: The New AI Attack Surface, 3 AI Security Predictions for 2026. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-12-03.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org