Why do AI agents create risk even when they detect phishing correctly?

Why This Matters for Security Teams

Correct phishing detection is not the same as safe execution. An AI agent can classify a message, URL, or attachment as malicious and still keep moving if the workflow allows it to continue. That is the core risk: the model’s awareness does not automatically become a security control. In agentic systems, the dangerous step is often the tool call, credential retrieval, or browser action that follows the detection event.

This is why current guidance increasingly separates detection from authorisation. The OWASP Agentic AI Top 10 and NIST AI Risk Management Framework both point practitioners toward runtime controls, context-aware decisioning, and governance over downstream actions. NHIMG research on the LLMjacking threat vector shows how exposed credentials and AI misuse quickly become operational compromise, not just detection failures. The lesson is that autonomous systems require policy enforcement at the moment of action, not trust in model confidence after the fact.

In practice, many security teams discover that an agent can spot the lure and still complete the compromise because the surrounding automation never asked whether it should proceed.

How It Works in Practice

Effective agent security treats phishing detection as one signal in a larger control chain. When an agent encounters a suspicious prompt, email, link, or file, the secure design should not simply log the finding. It should pause execution, evaluate intent, and require a policy decision before any privileged action occurs. That is where intent-based or context-aware authorisation comes in: the system decides at runtime whether this specific agent, for this specific task, in this specific context, may proceed.

Practitioners increasingly combine this with just-in-time credential provisioning and workload identity. Instead of giving an agent a long-lived API key or mailbox token, the workflow issues short-lived credentials for the exact task, then revokes them on completion. Workload identity, such as SPIFFE or OIDC-backed identities, helps prove what the agent is and what environment it is running in before secrets are released. This aligns with the operational direction reflected in the NHI Lifecycle Management Guide and the OWASP NHI Top 10, which emphasize lifecycle discipline and runtime control rather than static trust.

Classify the event, then stop execution until policy evaluates the next step.

Issue ephemeral credentials only for the approved task and revoke them immediately after use.

Bind access to workload identity, not to a reusable human-style session.

Log the agent’s intent, the tool requested, and the policy decision for audit and incident response.

These controls tend to break down in loosely coupled agent chains where one agent’s output becomes another agent’s instruction without a policy checkpoint in between.

Common Variations and Edge Cases

Tighter runtime control often increases friction, requiring organisations to balance safety against task latency and operator overhead. That tradeoff is real, especially in high-volume environments where agents process large message queues or execute multi-step browser workflows. Best practice is evolving, but there is no universal standard for exactly where a phishing classification should trigger a hard stop versus a soft warning.

Edge cases matter. A low-risk internal agent may only need constrained read-only access, while a customer-facing support agent might need brief access to ticketing, CRM, and knowledge base tools. If a workflow allows chained tool use, a single missed checkpoint can let a detected phishing lure turn into credential theft, lateral movement, or data exfiltration. This is why the control objective is not “make the model smarter,” but “make the system harder to misuse after the model has already recognised the threat.”

For governance, the CSA MAESTRO agentic AI threat modeling framework and NIST AI Risk Management Framework both support this runtime-first posture. Security teams should treat “detected phishing” as a prerequisite for stronger verification, not as permission to proceed.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A1	Addresses agent tool abuse after detection, the core failure mode here.
CSA MAESTRO		Covers threat modeling for agent workflows that continue after malicious input is identified.
NIST AI RMF		Supports governance, measurement, and runtime oversight for autonomous AI risk.

Use AI RMF governance to enforce stop conditions, accountability, and auditability at runtime.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Why do AI agents create risk even when they detect phishing correctly?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group