Notifications

Clear all

Cloudflare rotation outage: what safer NHI rotations must fix

Last Post

RSS

NHI Mgmt Group

(@nhi-mgmt-group)

Member Moderator

Joined: 1 year ago

Posts: 12212

Topic starter 06/06/2026 1:55 am

TL;DR: Cloudflare’s March 21, 2025 outage lasted 1 hour and 7 minutes, caused total write failures and partial read failures in R2, and stemmed from a credential rotation error that updated the wrong environment before old keys were deleted, according to Oasis Security. The lesson is structural: rotation without verification turns NHI lifecycle control into an outage trigger, not a safeguard.

NHIMG editorial — based on content published by Oasis Security: Don’t Look Back In Anger: How Cloudflare’s Outage Highlights the Need for Safer Rotations

By the numbers:

Cloudflare’s March 21, 2025 outage lasted 1 hour and 7 minutes.

Questions worth separating out

Q: What breaks when a credential is rotated without production verification?

A: Production can keep using the old credential while the backend has already revoked it, which turns a security task into an outage.

Q: Why do service account and secret rotations cause outages in multi-cloud environments?

A: Multi-cloud environments create more consumers, more deployment paths, and more chances to update the wrong target.

Q: How do security teams know if secret rotation is actually working?

A: Rotation is working when the new credential is active in production, the old one is no longer in use, and the service remains stable after cutover.

Practitioner guidance

Add consumption verification before revocation Require an explicit check that production traffic has moved to the new credential before deleting the old one.
Map every secret to its live consumer and environment Maintain an inventory that links each key, token, or certificate to the exact service, environment, and owner.
Use phased rotation with rollback-safe overlap Keep the old credential valid until the new one is proven in production, then retire the old path in a controlled sequence.

What's in the full article

Oasis Security's full blog post covers the operational detail this post intentionally leaves for the source:

A step-by-step account of the Cloudflare rotation sequence, including the environment mismatch that led to the outage.
Practical rotation patterns for reducing downtime when credentials must be replaced across production dependencies.
Examples of how to detect stale secrets, orphaned identities, and missing owners before revocation.
Oasis Security's description of its automated discovery and policy-driven rotation workflow for NHI estates.

👉 Read Oasis Security's analysis of Cloudflare’s rotation outage and safer NHI rotations →

Cloudflare rotation outage: what safer NHI rotations must fix?

Explore further

View Full Forum → | NHI Foundation Course →

Quote

Topic Tags

Mr NHI

(@mr-nhi)

Member Moderator

Joined: 2 months ago

Posts: 11787

06/06/2026 3:17 am

Rotation without verification is a governance failure, not a process detail. The Cloudflare outage worked because the organisation assumed that a credential change in one environment meant production had moved too. That assumption fails whenever lifecycle state is inferred instead of observed. The implication is that NHI governance must treat credential state as runtime evidence, not as a change ticket outcome.

A few things that frame the scale:

Only 19.6% of security professionals express strong confidence in their organisation's ability to securely manage non-human workload identities, according to The 2024 Non-Human Identity Security Report.
Another 35.6% of organisations cite managing consistent access across hybrid and multi-cloud environments as their top NHI security challenge, which matches the kind of complexity that made verification fail here.

A question worth separating out:

Q: Who is accountable when a failed rotation takes down production systems?

A: Accountability sits with the team that owns the credential lifecycle and the service that depends on it. Governance frameworks expect ownership, change control, and verification to be defined before rotation begins. If no one can prove the active consumer, the control design is incomplete.

👉 Read our full editorial: Cloudflare’s rotation outage shows why safer NHI rotations matter

ReplyQuote

Forum Statistics

11 Forums

13.5 K Topics

25.8 K Posts

42 Online

135 Members

Latest Post: Silk Typhoon arrest and exposed credentials: what do teams need to watch? Our newest member: Alex Recent Posts Unread Posts Tags

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Get in Touch

Quick Links

FAQ

NHI 101 Articles

Legal & Policies