TL;DR: AI agents that run for minutes or hours can lose state, repeat work, or fail mid-task unless their execution is durably recorded, according to WorkOS’s interview with Temporal co-founder Maxim Fateev. The governance question is not whether agents are clever, but whether their runtime assumptions survive crashes, retries, and intervention windows.
At a glance
What this is: This interview argues that durable execution is becoming a baseline requirement for AI agents because stateless workflows lose state when failures interrupt multi-step tasks.
Why it matters: It matters because IAM, NHI, and enterprise AI teams need auditability, continuity, and accountability when agent behaviour spans tools, approvals, and real data.
👉 Read WorkOS's interview on durable execution for AI agents
Context
Durable execution is the practice of preserving workflow state so a process can resume after a crash, restart, or infrastructure failure without redoing completed steps. In AI agent programmes, that matters because the agent may branch at runtime, wait on external services, and take actions over a long session, which makes ephemeral execution fragile.
For identity and governance teams, the issue is not just reliability. When an agent handles data, tools, and approvals across a session, the organisation needs a durable record of what it did, when it did it, and which state transitions were actually approved. That is the difference between an observable control plane and a black box run.
The starting position here is typical for teams moving from demos to production: they build first, then discover that retries, state loss, and incomplete audit trails are governance problems as much as engineering problems.
Key questions
Q: How should security teams govern AI agents that run long, multi-step workflows?
A: Security teams should require durable execution, full event history, and clear ownership for every multi-step agent workflow that touches sensitive data or privileged tools. If the agent can lose state on failure, the organisation cannot reliably audit what happened or prove which actions were completed versus replayed.
Q: Why do AI agents complicate access governance more than ordinary automation?
A: AI agents complicate access governance because they can branch at runtime, wait on external services, and continue later with the same operational context. That means privilege is not just granted at launch, it persists across a live session that must be observable, resumable, and attributable.
Q: What breaks when AI workflows cannot survive crashes or restarts?
A: When AI workflows cannot survive crashes or restarts, completed steps are lost, retries become manual, and teams risk duplicate actions or incomplete records. The practical failure is not only uptime loss. It is the loss of trustworthy evidence needed for governance, audit, and incident review.
Q: How do teams know if durable execution is actually working for agents?
A: Teams know durable execution is working when a workflow can resume from a failure point without redoing completed work, and when the event history shows every decision and activity in order. If the only recovery option is to restart from scratch, the control is not real.
Technical breakdown
Why stateless AI agent workflows fail under failure recovery
Stateless agent frameworks assume a run can restart cleanly from the beginning. That works for short, deterministic tasks, but it breaks when an agent has already fetched data, made branching decisions, or written partial results. Without persisted workflow state, the system cannot know which steps are safe to replay and which would create duplicate actions or inconsistent outcomes. Durable execution changes the runtime model by storing the workflow history and rehydrating the process after interruption. The practical effect is that the agent becomes resumable rather than disposable.
Practical implication: if a workflow cannot be resumed from persisted state, treat it as non-production-grade for agent execution.
How durable execution separates workflow logic from infrastructure
Durable execution treats the workflow definition as application logic and the execution engine as the persistence layer. The workflow describes what should happen, while the orchestration layer records step completion, retries failed activities, and schedules the next action after recovery. This separation matters because the developer no longer needs to hand-roll checkpointing or retry state into every agent path. The architecture also supports event history, which turns execution into an auditable timeline rather than a transient process. For enterprise AI, that is the difference between building a tool and building a governed service.
Practical implication: place recovery, retries, and history outside the agent code path so the runtime, not the app team, owns continuity.
Why runtime observability becomes part of agent identity governance
When an agent coordinates LLM calls, APIs, databases, and human approval steps, the governance challenge is not only whether the action succeeded. It is whether the organisation can reconstruct the sequence of decisions, inputs, and side effects after the fact. Durable execution creates a complete event trail, which is important for troubleshooting, audit, and accountability. In practice, this makes the workflow engine part of the identity control surface because it preserves the evidence needed to review how an agent used access over time. Without that trail, access governance degrades into guesswork.
Practical implication: require full event history for agent workflows that touch sensitive data or privileged tools.
Breaches seen in the wild
- Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
- AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.
Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.
NHI Mgmt Group analysis
Durable execution is becoming an identity control surface for AI agents. Once an agent can run for extended periods, branch at runtime, and survive failures, the workflow engine is no longer just infrastructure. It becomes the place where state, retries, and evidence are preserved. That makes it relevant to identity governance because the organisation can only govern what it can later reconstruct, and reconstruction depends on durable event history.
Stateless execution was designed for short-lived processes, not agentic sessions. That assumption fails when the actor is autonomous in runtime behaviour because decisions, tool calls, and waiting periods can span a long session with no stable checkpoint for review. The implication is that teams must rethink access review, accountability, and recovery models around resumable sessions rather than ephemeral runs.
Long-running agent workflows create a governance gap between action and evidence. When a workflow can fail midstream, the absence of persistence turns partial execution into lost context, which weakens auditability and incident reconstruction. The practical conclusion is that identity teams should treat event history and replayability as governance requirements, not optional engineering detail.
Durability exposes the hidden privilege problem inside agent orchestration. An agent that can call tools, wait on external responses, and continue later is effectively carrying privilege across time. That means the real control question is not only who initiated the run, but what state and access persisted through it. Practitioners should evaluate whether their orchestration layer is preserving or obscuring that privilege boundary.
Agent reliability and identity governance are converging around the same failure modes. In production, unreliable retries, lost state, and invisible execution paths are not just uptime issues. They also determine whether an enterprise can prove what an agent accessed, changed, or completed. The practical takeaway is to align workflow orchestration with access evidence from the start, before agent usage expands beyond controlled pilots.
From our research:
- 80% of organisations report their AI agents have already performed actions beyond their intended scope, including accessing unauthorised systems (39%), inappropriately sharing sensitive data (31%), and revealing access credentials (23%), according to AI Agents: The New Attack Surface report.
- Only 44% of organisations have implemented any policies to govern AI agents, despite 92% agreeing that governing them is critical to enterprise security, according to SailPoint research.
- For a wider view of the control problem, read OWASP Agentic AI Top 10 and assess where workflow durability intersects with tool misuse and agent scope drift.
What this signals
Durable execution will become a baseline expectation for any AI programme that crosses the demo-to-production line. As agent use expands, teams will need to prove that execution history survives failures and that recovery does not erase evidence. The relevant benchmark is no longer whether the agent can complete a task once, but whether the programme can still explain the task after interruption, replay, or rollback.
Identity governance teams should treat replayability as an evidence requirement. If a workflow cannot be reconstructed after failure, then access review, incident triage, and privilege validation all become weaker. That is why durable event history should sit alongside least privilege in the operating model, especially for workflows that touch sensitive data or production systems.
The broader market signal is that agent orchestration and NHI governance are converging around the same operational control point. Teams that already use the NHI Lifecycle Management Guide for provisioning and offboarding should extend that discipline to agent sessions, because session continuity now determines how access evidence is created and preserved.
For practitioners
- Inventory agent workflows that cannot survive interruption Map every AI agent flow that loses state on crash, restart, or worker reschedule. If the workflow has to restart from the beginning or depends on ad hoc checkpointing, classify it as a governance risk because you cannot reliably reconstruct the execution path.
- Require persisted event history for all privileged agent tasks Make complete step history mandatory for any workflow that touches customer data, secrets, or production systems. Use the preserved history to support incident review, access validation, and post-execution accountability.
- Separate business logic from retry and recovery logic Keep workflow decisions in the agent code and place retries, persistence, and scheduling in the orchestration layer. This reduces hand-built checkpoint logic and gives the security team one place to inspect execution continuity.
- Validate agent workflows for resumability before production rollout Test whether an agent can resume after failure without reissuing completed actions or losing approval state. If it cannot, the workflow is not ready for high-risk access paths or regulated data handling.
Key takeaways
- AI agents that lose state on failure create a governance problem as much as an engineering problem, because no one can reliably reconstruct what happened.
- Durable execution turns workflow history into evidence, which is essential when agents call tools, wait on approvals, and touch sensitive data.
- If an agent cannot resume cleanly after interruption, the workflow is not ready for high-risk enterprise access paths.
Standards & Framework Alignment
This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.
OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 set the governance and control requirements practitioners need to meet.
| Framework | Control / Reference | Relevance |
|---|---|---|
| OWASP Agentic AI Top 10 | A-03 | Agent workflow reliability and tool misuse sit inside runtime agent governance. |
| OWASP Non-Human Identity Top 10 | NHI-06 | Agent sessions depend on persistent identity evidence and recoverable execution state. |
| NIST CSF 2.0 | PR.AC-4 | Durable access trails support privileged access review and accountability. |
Track agent identity activity end-to-end and preserve execution history for auditability.
Key terms
- Durable Execution: Durable execution is a workflow design pattern that preserves state so a process can resume after failure without redoing completed work. In AI agent systems, it turns a fragile run into a recoverable sequence with replay, history, and auditability.
- Workflow History: Workflow history is the ordered record of steps, decisions, retries, and signals that occurred during execution. For agent governance, it is the evidence trail that shows what happened, what was retried, and where a workflow resumed after interruption.
- Resumable Session: A resumable session is an execution context that can continue after a crash, restart, or worker change without losing the state needed to finish. In agentic systems, resumability is a governance requirement because it preserves continuity, attribution, and control boundaries.
- Event Orchestration Layer: The event orchestration layer is the infrastructure component that schedules, persists, and coordinates workflow steps. It matters in identity governance because it can hold the authoritative record of agent actions, retries, and completion states across failures.
Deepen your knowledge
Durable execution for AI agents is covered in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building agent workflows that must survive failure and still remain governable, this is a relevant starting point.
This post draws on content published by WorkOS: Maxim Fateev on why durable execution matters for AI agents. Read the original.
Published by the NHIMG editorial team on 2026-04-15.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org