TL;DR: Attackers rapidly adapted to early agent capabilities, with system-prompt extraction, subtle safety bypasses, exploratory probing, and indirect attacks through untrusted external content emerging as the dominant patterns, according to Lakera’s 30-day Q4 2025 snapshot. The lesson is that once models read documents, browse sources, or call tools, security must shift from prompt filtering to workflow-level control.
At a glance
What this is: This analysis shows how Q4 2025 attack behavior shifted as early agentic systems began reading documents, browsing external sources, and invoking tools, with indirect prompt injection and prompt leakage becoming the most reliable attack paths.
Why it matters: It matters because identity and access controls for AI agents cannot stop at the prompt layer, and IAM teams need to treat external content, tool calls, and workflow boundaries as part of the security perimeter.
By the numbers:
- Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap.
- When AWS credentials are exposed publicly, attackers attempt access within an average of 17 minutes and as quickly as 9 minutes in some cases.
👉 Read Lakera’s Q4 2025 analysis of agent attack patterns and prompt injection
Context
AI agents widen the attack surface because they do more than generate text. Once a model can read documents, browse untrusted content, call tools, or move through a workflow, attackers gain new ways to influence what the system sees and does. That is the real governance problem: security assumptions built for static prompts do not hold when the system can ingest, transform, and act on external inputs at runtime.
Lakera’s Q4 snapshot is useful because it reflects attacker behaviour around the first practical wave of agent features, not a mature end state. The patterns were familiar in one sense, but the target changed. Indirect instructions hidden in webpages or files, prompt-leakage attempts, and role-framing attacks show that agent security is already becoming an identity and access problem as much as a content-moderation problem.
Key questions
Q: How should security teams handle untrusted content in AI agent workflows?
A: Security teams should treat every retrieved page, file, message, or feed as untrusted until it is validated by provenance and policy checks. The control point is not the prompt alone. It is the retrieval, rendering, and tool-calling chain that turns external content into model context and then into action.
Q: Why do AI agents make prompt injection harder to contain?
A: AI agents make prompt injection harder to contain because they do more than answer questions. They can read documents, call tools, and take multi-step actions, which means a malicious instruction can enter through external content and influence behaviour after the user prompt has already been accepted.
Q: What do security teams get wrong about system prompt leakage?
A: They often treat it as only an information disclosure issue. In practice, leaked system prompts can reveal tool scope, policy boundaries, and workflow logic, which helps attackers shape the next interaction and increases the chance of unsafe tool use or data exposure.
Q: What should organisations do when agent safety checks are bypassed by role framing?
A: Organisations should move safety enforcement closer to the action path and not rely only on language-based checks. If analysis, simulation, or evaluation framing can change behaviour, then the governance model is too dependent on conversational intent and too weak at runtime authorisation.
Technical breakdown
Why indirect prompt injection outperforms direct attacks
Indirect prompt injection happens when malicious instructions arrive through content the agent is asked to process, such as a web page, document, or structured feed. That bypasses the obvious user prompt path and exploits the trust the system places in retrieved or rendered content. Because the instruction is embedded in legitimate-looking material, simple keyword filters and front-door safety checks often miss it. This is why agentic workflows expand the attack surface beyond chat input: every retrieved artefact can become an execution influence point.
Practical implication: treat external content as untrusted input and enforce inspection at retrieval, rendering, and tool-calling stages.
System prompt leakage as an execution blueprint
System prompts are not just configuration text. In agentic systems they often describe roles, tools, policy boundaries, and workflow logic, which means disclosure gives attackers a map of how the system reasons and when it will comply. Techniques such as hypothetical role framing and obfuscation work because they nudge the model into revealing internal instructions or constraints. In practice, prompt leakage becomes a governance issue when the system prompt reveals enough about tool permissions or routing logic to let an attacker shape the next action.
Practical implication: minimise exposed policy detail and separate instruction layers from content handling paths.
Why subtle safety bypasses work in agent workflows
The Q4 patterns show that models are easier to manipulate when the request is reframed as analysis, transformation, evaluation, or simulation. Those labels change the model’s interpretation of intent without changing the underlying harmful objective. In agent workflows, that matters because the model may be asked to summarise, compare, or normalise content before acting on it. The attacker is not defeating the safety policy directly. Instead, they are steering the agent into a different conversational frame where the policy boundary becomes less effective.
Practical implication: apply intent checks and approval gates to the action path, not just to user-facing language.
Threat narrative
Attacker objective: The attacker aims to coerce the agent into revealing internal instructions or performing actions that expose confidential data, bypass policy, or extend control into connected tools.
- Entry begins when the agent consumes an untrusted webpage, document, or structured prompt that contains hidden instructions or role-framing text.
- Escalation occurs when the model treats that content as legitimate context, exposing system instructions, policy boundaries, or tool descriptions that help shape the next step.
- Impact follows when the attacker uses the disclosed logic or induced behaviour to steer tool use, leak confidential data, or bypass safety controls across the workflow.
Breaches seen in the wild
- Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
- AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.
Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.
NHI Mgmt Group analysis
Indirect prompt injection is now an identity boundary problem, not just a content problem. Once an agent reads external content and can act on it, the question is no longer whether the text is harmful in isolation. The real issue is whether the system can distinguish trusted instructions from untrusted material at runtime. That makes tool invocation, retrieval, and rendering part of the identity control surface. Practitioners should treat external content as a delegated influence channel, not a passive input.
System prompt leakage exposes the operating model of the agent. The prompt often contains role definitions, tool scope, and workflow logic, which means disclosure gives attackers a map of the system’s decision structure. This is not just an implementation weakness. It shows that agent governance still assumes the attacker cannot see the control model. The implication is that systems built around hidden policy text may be easier to steer than teams expect.
Role-framing attacks succeed because current safety models still assume intent is stable across the session. Hypothetical scenarios, evaluation framing, and harmless transformation requests work by changing the conversational context without changing the attacker’s goal. That assumption was designed for human-style chat interactions. It fails when the same model is expected to reason, transform, and execute across multiple steps. The implication is that identity governance for agents must stop relying on a single request boundary as the unit of trust.
Workflow-level security is becoming the decisive control plane for agentic systems. The Q4 signal is not that agents are uniformly compromised, but that attack paths now run through documents, tools, and external sources as easily as through text prompts. That places execution-layer controls at the centre of governance. Practitioners should read this as a shift from content moderation to runtime authorisation, traceability, and provenance.
Agent security is converging with NHI governance faster than many programmes have planned for. A model that reads files, calls tools, and follows multi-step instructions behaves like a non-human identity with delegated access. The new failure mode is not just bad output. It is unauthorised influence over what the system is allowed to see and do. Teams should align AI security reviews with identity lifecycle, entitlement boundaries, and access review discipline.
From our research:
- The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
- 43% of security professionals are concerned about AI systems learning and reproducing sensitive information patterns from codebases, according to The State of Secrets in AppSec.
- For the broader governance pattern, see The 52 NHI breaches Report, which shows how identity exposure turns into operational compromise once access is not tightly bounded.
What this signals
Indirect prompt injection is becoming the default escalation path for agentic systems. As more teams connect models to browsers, documents, and internal data, the attack surface shifts from the visible prompt to every trust boundary in the workflow. That is why agent security should be evaluated alongside OWASP Agentic AI Top 10 style threat modelling, not only content moderation.
Ephemeral content access creates persistent governance debt when provenance is not enforced. Even when a model only touches data briefly, the influence can persist through summaries, extracted instructions, or tool-triggered actions. With 27 days as the average time to remediate a leaked secret according to The State of Secrets in AppSec, operational delay becomes a material risk factor in any agent workflow that can ingest secrets or sensitive context.
Identity programmes should start treating agents as governed non-human identities with runtime constraints. The next step is not just better filtering. It is aligning agent access, delegation, and offboarding with lifecycle controls, provenance checks, and explicit tool boundaries so security teams can see what the agent was allowed to do before it does it.
For practitioners
- Classify every agent input as trusted or untrusted Apply explicit trust handling to retrieved web pages, documents, email, and other external sources before they reach model context. Do not let content provenance disappear once it enters the workflow.
- Constrain tool use to narrow, auditable permissions Limit each agent to the smallest possible tool set and log every call with input provenance, output destination, and downstream effect. Shorten the path between content ingestion and action execution.
- Separate instruction layers from content processing Keep policy instructions, system prompts, and task content distinct so attackers cannot exploit a single mixed context to change behaviour. Review where hidden instructions could be extracted or inferred.
- Add provenance checks to retrieval and rendering paths Validate where content came from before the model can summarise, transform, or act on it. Treat provenance failures as security events, not formatting issues.
- Review agent governance as an NHI programme extension Map agent access, delegation, and offboarding to the same lifecycle discipline used for other non-human identities, including entitlement review and emergency revocation.
Key takeaways
- Q4 2025 showed that attackers are already using untrusted content, role framing, and prompt leakage to steer early agentic systems.
- The scale of the problem is structural because agents expand the trust boundary from a prompt to the full workflow, including retrieval and tool use.
- Security teams should shift from content-only controls to provenance, runtime authorisation, and lifecycle governance for agent access.
Standards & Framework Alignment
This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.
OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 set the governance and control requirements practitioners need to meet.
| Framework | Control / Reference | Relevance |
|---|---|---|
| OWASP Agentic AI Top 10 | LLM01 | Directly addresses prompt injection and instruction abuse in agent workflows. |
| OWASP Non-Human Identity Top 10 | NHI-03 | Agent access and lifecycle handling mirror NHI credential governance. |
| NIST CSF 2.0 | PR.AC-4 | Workflow access and privilege boundaries are central to the article’s risk model. |
Map agent inputs and tool actions to LLM01 controls and inspect every external content path.
Key terms
- Indirect Prompt Injection: A malicious instruction hidden inside content the model is asked to read or process. The attack works because the harmful text arrives as apparently legitimate context, then influences the agent’s behaviour during retrieval, summarisation, or tool use.
- System Prompt Leakage: Exposure of internal instructions that define an agent’s role, policy boundaries, or tool scope. When those details are disclosed, attackers gain a blueprint for steering the system and identifying where its guardrails are most likely to fail.
- Runtime Authorisation: A decision made at execution time about whether an identity may take a specific action, use a tool, or reach a resource. For agents, this must account for changing context and should not rely only on static access granted at setup.
- Provenance Control: A governance mechanism that records where input came from and how trustworthy it is before the system acts on it. In agent workflows, provenance control helps separate safe internal instructions from untrusted external material.
What's in the full article
Lakera’s full article covers the operational detail this post intentionally leaves for the source:
- 30-day Q4 2025 attack sampling methodology across Lakera Guard-protected systems and the Gandalf: Agent Breaker environment
- Technique-by-technique breakdown of system-prompt extraction attempts, including hypothetical scenarios and obfuscation patterns
- Examples of indirect prompt injection payloads hidden inside webpages, files, and structured content
- Observed attacker adaptation patterns across browsing, retrieval, and lightweight tool-use scenarios
Deepen your knowledge
NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building or maturing an identity security programme, it is worth exploring.
Published by the NHIMG editorial team on 2026-04-20.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org