Cloudflare’s rotation outage shows why safer NHI rotations matter

By NHI Mgmt Group Editorial TeamPublished 2026-05-01Domain: Workload IdentitySource: Oasis Security

TL;DR: Cloudflare’s March 21, 2025 outage lasted 1 hour and 7 minutes, caused total write failures and partial read failures in R2, and stemmed from a credential rotation error that updated the wrong environment before old keys were deleted, according to Oasis Security. The lesson is structural: rotation without verification turns NHI lifecycle control into an outage trigger, not a safeguard.

At a glance

What this is: This analysis shows how a credential rotation mistake at Cloudflare turned NHI lifecycle management into a global service outage.

Why it matters: It matters because IAM teams cannot treat rotation as a clerical task when production dependencies, environment targeting, and verification controls decide whether access continuity survives.

By the numbers:

Cloudflare’s March 21, 2025 outage lasted 1 hour and 7 minutes.

👉 Read Oasis Security's analysis of Cloudflare’s rotation outage and safer NHI rotations

Context

Credential rotation is meant to reduce exposure, but it only works when teams know which identity is live, which environment it belongs to, and when it is safe to retire the old secret. The Cloudflare outage shows that NHI lifecycle management fails when deletion happens before production verification, especially in multi-environment delivery flows.

For IAM and PAM teams, this is not just an operational hygiene issue. It is a governance problem about ownership, dependency mapping, and confirmation that the intended credential is actually in use before the old one is removed. That is typical of modern cloud environments, where deployment paths and credential consumers are often more tangled than teams assume.

Key questions

Q: What breaks when a credential is rotated without production verification?

A: Production can keep using the old credential while the backend has already revoked it, which turns a security task into an outage. The failure mode is not the rotation itself, but the assumption that a change in one environment automatically applies everywhere. Teams should verify live consumption before decommissioning any secret.

Q: Why do service account and secret rotations cause outages in multi-cloud environments?

A: Multi-cloud environments create more consumers, more deployment paths, and more chances to update the wrong target. If ownership and dependency mapping are incomplete, teams may revoke a secret before the live workload has moved. The result is service interruption, not improved security.

Q: How do security teams know if secret rotation is actually working?

A: Rotation is working when the new credential is active in production, the old one is no longer in use, and the service remains stable after cutover. The strongest signal is not that a change was approved, but that runtime evidence confirms the consumer has switched.

Q: Who is accountable when a failed rotation takes down production systems?

A: Accountability sits with the team that owns the credential lifecycle and the service that depends on it. Governance frameworks expect ownership, change control, and verification to be defined before rotation begins. If no one can prove the active consumer, the control design is incomplete.

Technical breakdown

Credential rotation in multi-environment pipelines

Credential rotation is a staged identity change, not a simple replacement. In Cloudflare’s case, the new key pair was generated, but the updated credential was sent to the default environment because the production flag was omitted. That meant production kept using the old secret while the backend had already been prepared to decommission it. The technical failure is a mismatch between deployment intent and actual runtime dependency. In NHI terms, the rotation succeeded procedurally but failed operationally because the consuming system never migrated. This is why environment targeting, dependency mapping, and consumption checks matter as much as the rotation event itself.

Practical implication: verify the active production consumer before any credential decommissioning.

Why rotation verification is the missing control

Verification is the control that confirms the new credential is both deployed and in use, and that the old one is no longer serving live traffic. Cloudflare’s outage happened because the process assumed migration had occurred when it had not. Without runtime verification, rotation becomes a blind trust exercise, especially in systems with multiple gateways, services, and credentials. The deeper problem is that lifecycle actions are being executed on assumptions rather than observed state. For NHI governance, that creates a dangerous gap between change approval and real service behaviour.

Practical implication: require consumption-level verification before revoking any old key or token.

Dependency visibility and rollback-safe rotations

A safe rotation process needs a clear map of which credential authenticates which service, what downstream systems depend on it, and how to revert if validation fails. The outage description shows that Cloudflare’s R2 Gateway depended on multiple underlying services and credentials, but those dependencies were not visible enough to prevent a bad cutover. Rolling rotations reduce this risk by preserving the old path until the new path is proven. In governance terms, the issue is not just secret hygiene. It is whether the rotation workflow is reversible, observable, and tied to service ownership.

Practical implication: maintain dependency maps and phased rollback paths for every high-value secret.

Threat narrative

Attacker objective: The objective was not malicious compromise but inadvertent service interruption caused by mismanaged credential lifecycle execution.

Entry occurred through an internal credential rotation workflow that targeted the wrong environment, leaving production on an outdated credential set.
Credential access became service disruption when the old key was deleted before production had migrated to the new one, causing authentication mismatch.
Impact followed immediately as Cloudflare R2 experienced total write failures and degraded read operations worldwide.

Cisco DevHub NHI breach — IntelBroker exploited exposed Cisco credentials, API tokens and keys in DevHub.
230M AWS environment compromise — 230M AWS environments compromised via exposed .env files with cloud credentials.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Rotation without verification is a governance failure, not a process detail. The Cloudflare outage worked because the organisation assumed that a credential change in one environment meant production had moved too. That assumption fails whenever lifecycle state is inferred instead of observed. The implication is that NHI governance must treat credential state as runtime evidence, not as a change ticket outcome.

Credential ownership and dependency mapping are now core control boundaries. A rotation workflow cannot safely revoke secrets when nobody can prove which service is still using them. This is exactly where NHI lifecycle governance meets service topology. When ownership, consumer context, and dependency paths are unclear, the rotation task becomes a service outage mechanism rather than a security safeguard.

Safe secret rotation is really blast-radius control. The practical question is not whether a secret can be rotated on schedule, but whether the organisation can constrain failure when the new path does not fully take over. That aligns with OWASP-NHI and NIST CSF thinking, but the field should be sharper about the named failure mode: verification gap rotation. Practitioners should treat this as an architecture problem, not a calendar problem.

Multi-cloud and multi-service environments expose the limits of manual lifecycle operations. The more deployment paths and consumers a secret has, the less credible a manual rotate-and-delete model becomes. This is why lifecycle governance needs continuous visibility into active use, not just inventory. Teams should rethink whether their current process can actually survive the dependency complexity they already operate.

Orphaned and misunderstood credentials create trust debt across the identity estate. Once a team cannot confidently answer where a secret is used, every future rotation becomes riskier. That debt accumulates until one routine change takes down a critical workload. The practitioner conclusion is simple: if you cannot prove usage before revocation, you do not yet control the lifecycle.

From our research:
Only 19.6% of security professionals express strong confidence in their organisation's ability to securely manage non-human workload identities, according to The 2024 Non-Human Identity Security Report.
Another 35.6% of organisations cite managing consistent access across hybrid and multi-cloud environments as their top NHI security challenge, which matches the kind of complexity that made verification fail here.
NHI Lifecycle Management Guide shows why lifecycle ownership and revocation checks must be tied to real consumer state, not assumed migration.

What this signals

Verification gap rotation: the industry keeps treating secret rotation as a calendar event, when the real control is proof of live consumer transition. As multi-cloud estates expand, teams need to watch for environments where secret state, service ownership, and deployment state drift apart.

With 35.6% of organisations already naming consistent access across hybrid and multi-cloud environments as their top challenge, per The 2024 Non-Human Identity Security Report, the operational signal is clear: lifecycle process maturity is lagging behind infrastructure complexity.

The immediate programme implication is to tie rotation workflows to dependency maps, runtime logs, and rollback-ready overlap windows. That shift turns secret renewal from a trust exercise into an observed control, which is the only way to reduce outage risk at scale.

For practitioners

Add consumption verification before revocation Require an explicit check that production traffic has moved to the new credential before deleting the old one. Use runtime logs, consumer identity mapping, and service-level confirmation rather than change approval alone.
Map every secret to its live consumer and environment Maintain an inventory that links each key, token, or certificate to the exact service, environment, and owner. If a secret has no mapped owner or consumer, treat it as a governance defect before the next rotation.
Use phased rotation with rollback-safe overlap Keep the old credential valid until the new one is proven in production, then retire the old path in a controlled sequence. This reduces the chance that a mistaken deployment takes down the service.
Automate detection of stale or orphaned secrets Continuously scan for secrets that remain active after their intended migration window, especially across cloud, vault, and CI/CD boundaries. Link that detection to ownership remediation and escalation paths.
Review dependency chains before high-impact rotations For services with multiple upstream and downstream dependencies, document which component authenticates where and what fails if the credential changes. Use that map to decide whether a rotation is safe to execute now.

Key takeaways

The outage showed that rotation can fail even when teams believe they followed the process, because environment targeting and verification were missing.
The impact was measurable and immediate, with a 1 hour and 7 minute service disruption and global write failures in R2.
The control that would have limited the blast radius is production verification before revocation, backed by ownership and dependency mapping.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Non-Human Identity Top 10	NHI-03	Rotation failure and revocation timing map directly to secret lifecycle controls.
NIST CSF 2.0	PR.AC-1	Access lifecycle management and ownership are central to safe credential retirement.
NIST Zero Trust (SP 800-207)	PR.AC-4	Zero trust requires continuous validation of identity and access continuity during rotation.

Verify secret consumption before revocation and enforce phased rotation with rollback overlap.

Key terms

Credential Rotation: Credential rotation is the controlled replacement of a secret, token, key, or certificate before it becomes risky or invalid. In NHI governance, the hard part is not generating the new credential but proving that the live workload has switched before the old one is removed.
Production Verification: Production verification is the check that confirms a new credential is actually being used by the live service in the intended environment. It matters because lifecycle actions based only on approval or deployment status can revoke access that production still depends on.
Dependency Mapping: Dependency mapping is the process of identifying which services, environments, and downstream systems rely on a given identity or secret. For NHIs, it is essential because a hidden consumer can turn a routine rotation into an outage or access failure.
Orphaned Secret: An orphaned secret is a credential that remains active without a clear owner, consumer, or business purpose. In NHI programmes, orphaning often appears after service changes, and it raises both security risk and operational uncertainty during rotation or revocation.

Deepen your knowledge

Credential rotation and lifecycle verification are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If your team is dealing with multi-cloud secrets and outage risk, the course is worth exploring.

This post draws on content published by Oasis Security: Don’t Look Back In Anger: How Cloudflare’s Outage Highlights the Need for Safer Rotations. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-05-01.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org