Cloudflare outage exposes the operational risk of standing privileges

By NHI Mgmt Group Editorial TeamPublished 2025-11-20Domain: Governance & RiskSource: Apono

TL;DR: Cloudflare’s outage was triggered by a permissions change that expanded an internal feature file until it broke production systems, showing how standing privileges can create operational blast radius even without an attacker, according to Apono. The incident reinforces that access scope and permanence are reliability controls as much as security controls.

At a glance

What this is: Apono argues that Cloudflare’s outage shows how standing privileges can turn a routine permissions change into a global service failure.

Why it matters: For IAM, PAM, and NHI teams, it is a reminder that access scope and permanence affect operational resilience across human, service, and automation identities.

👉 Read Apono's analysis of the Cloudflare outage and standing privilege risk

Context

Cloudflare’s outage is a privilege governance problem as much as an availability event. A permissions change in an internal database system caused a feature file to grow beyond its limit, and the failure propagated across production. The primary lesson for identity programmes is that standing access can create outage risk even when no attacker is present.

This is a classic Zero Standing Privileges question in a cloud environment: who can change sensitive systems, how broadly can that change flow, and whether permanent access is necessary for the task. The incident also maps directly to NHI governance because service accounts, automation pipelines, and API keys often carry the same operational reach as human operators with far less scrutiny.

Key questions

Q: What breaks when standing privileges are left in place for cloud infrastructure changes?

A: Standing privileges increase the chance that a routine change can affect shared systems far beyond the intended task. In cloud environments, that can turn one valid administrative action into a production outage, a security incident, or both. The problem is not just misuse by attackers. Persistent access expands the blast radius of legitimate work.

Q: Why do standing privileges increase operational risk in infrastructure teams?

A: Because they let the same identity retain broad rights across changing contexts, including when the task no longer needs them. That persistence creates exposure if the identity is misused, misconfigured, or simply used correctly in a way that has wider impact than expected. Temporary access and narrow scope reduce that risk.

Q: How do teams know if privilege governance is actually reducing outage risk?

A: Look for fewer always-on write permissions on shared production systems, tighter scope on high-risk identities, and more changes executed through temporary elevation with expiry. If privileged actions still rely on broad permanent access, the governance model is limiting visibility but not reducing the underlying failure surface.

Q: Who is accountable when a permissions change causes a service outage?

A: Accountability usually sits with the teams that own access design, operational change control, and system resilience, not with identity alone. If a permissions change can cascade into a widespread failure, the governance model has to be reviewed jointly by IAM, PAM, and infrastructure owners. Zero standing access is one control, but ownership is the accountability question.

Technical breakdown

How a permissions change became a production outage

The outage began with an internal permissions change, not an intrusion. That change altered how a recurring database query generated a Bot-Management feature file, and the file began emitting unintended entries. Once the file exceeded a built-in size limit, systems that loaded it failed immediately. The critical mechanism here is not file corruption in isolation, but the chain reaction created when one privileged configuration change is allowed to influence a production artifact consumed globally.

Practical implication: Map which identities can modify shared control-plane inputs and require tighter approval and scope controls around those changes.

Standing privileges and blast radius in cloud operations

Standing privileges are persistent rights that remain available even when they are not actively needed. In cloud and infrastructure workflows, they often sit behind database changes, proxy configuration, feature generation, and automation pipelines. That makes them dangerous not only because an attacker could abuse them, but because a legitimate operator can unintentionally trigger wide impact. The technical failure mode is excessive blast radius, where one allowed action cascades into a global service outage.

Practical implication: Treat persistent write access to shared infrastructure as a blast-radius issue and constrain it to the smallest possible operational boundary.

Zero Standing Privileges as an operational control

Zero Standing Privileges removes always-on access and replaces it with Just-in-Time elevation, narrow scope, and automatic expiry. In practice, that changes the failure model from permanent reach to temporary, auditable elevation. For infrastructure identities, the control matters because it forces each sensitive action to be explicit, time-bound, and reviewable. The Cloudflare case shows why this is not just a breach-prevention pattern. It is also a resilience pattern for preventing accidental propagation of a bad change.

Practical implication: Use JIT elevation for sensitive infrastructure changes so the identity cannot retain broad rights after the task is complete.

Threat narrative

Attacker objective: There was no attacker objective in this incident because the event was an internal failure, but the operational outcome was global service disruption.

Entry occurred through a legitimate internal permissions change to a Cloudflare database system, not through external compromise.
Privilege escalation was operational, not adversarial. The changed workflow allowed a recurring query to generate a larger feature file than intended, expanding the effect of a normal administrative action.
Impact followed when the oversized file propagated across the network and systems that loaded it began failing, creating a widespread outage.

Cisco DevHub NHI breach — IntelBroker exploited exposed Cisco credentials, API tokens and keys in DevHub.
MongoBleed breach — MongoBleed exposed secrets across 87K MongoDB servers.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Standing privileges are not only a breach risk, they are a reliability risk. Cloudflare’s outage shows that a legitimate permissions change can still produce global failure when persistent access reaches too far into shared production systems. The industry habit of treating IAM as a security-only discipline misses the operational blast radius created by overbroad rights. Practitioners need to treat privilege design as an availability control, not just an access control.

Zero Standing Privileges matters because it changes the failure mode of infrastructure changes. A permanent entitlement can survive long enough to turn a single action into a network-wide outage, especially when the identity is tied to databases, feature generation, or automation. The problem is not simply that access exists. The problem is that access persists across contexts that should have been isolated. The practitioner takeaway is that persistence itself is part of the risk surface.

Standing privilege without task scoping is a governance assumption that no longer holds in modern cloud operations. That assumption was designed for environments where operators changed systems in narrow, manual ways. It fails when one identity can influence a shared artifact consumed across production, because the action is no longer local to the operator. The implication is that IAM, PAM, and SRE teams must rethink how much system-wide effect any one identity is allowed to have.

Non-human identities deserve the same privilege scrutiny as human admins. Service accounts, automation pipelines, and API keys often inherit the broadest access because they are easiest to operationalise, not because they are easiest to justify. That creates an identity governance blind spot where machine access is less visible but more durable. Practitioners should assume that hidden persistence in NHI access can become the most fragile part of the environment.

Identity blast radius is the better lens for this class of incident. The question is not whether an identity can access a resource, but how far a valid action can propagate once exercised. Cloudflare’s outage shows that the boundary between access and impact is often much thinner than standard entitlement reviews assume. Teams should re-evaluate whether their access model measures permission alone or the downstream effect of permission.

From our research:
70% of organisations grant AI systems more access than they would give a human employee performing the exact same job, according to The 2026 Infrastructure Identity Survey.
Only 17% incident rates were seen in least-privileged AI environments, compared with 76% in over-privileged systems, showing a 4.5x difference in security outcome, according to the same survey.
For practitioners reviewing access models, see Ultimate Guide to NHIs , Key Challenges and Risks for the broader privilege and sprawl context.

What this signals

Identity blast radius should become a standard governance metric for cloud programmes. When a single permissions change can propagate into global service impact, the access review question is no longer only who can do what, but how far the effect of that action can travel. Teams that still measure permissions without measuring downstream effect will miss the risk Cloudflare exposed.

The same pattern applies across human admins, service accounts, and automation identities. A privilege model that depends on permanent access to shared systems creates a hidden reliability tax, because every change carries an outsized failure path. For reader programmes, the next step is to align PAM, IGA, and SRE change controls around temporary access and scoped authority.

With 70% of organisations already granting AI systems more access than they would give a human employee doing the same job, the governance gap is broader than one outage. It points to a market-wide tolerance for overreach that will keep surfacing as operational fragility unless teams redesign access for actual blast radius.

For practitioners

Reclassify sensitive write access by blast radius Identify database, feature-generation, and control-plane permissions that can affect shared production services and move them into higher-risk review paths. Focus on any identity whose actions can propagate beyond its immediate task boundary.
Replace standing administrative access with JIT elevation Grant temporary access only for the duration of a specific change, and require automatic expiry once the task is complete. Keep the elevation scoped to the narrowest object, schema, or configuration set that the work requires.
Separate operational convenience from system-wide authority Review service accounts, automation pipelines, and API keys that can alter shared artefacts or production inputs. If the identity exists for routine operations, it should not also retain broad modification rights without a documented justification.
Test failure propagation before approving access paths Simulate how a single permitted change could affect downstream systems, especially where generated files, cached artefacts, or replicated configuration are involved. Approve only the access paths whose blast radius is understood and acceptable.

Key takeaways

Cloudflare’s outage shows that standing privileges can create availability failures even when no attacker is involved.
The incident demonstrates that the real risk is blast radius, because one permissions change can cascade into production-wide impact.
Zero Standing Privileges and tightly scoped JIT elevation are the practical controls that reduce both breach exposure and operational fragility.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Non-Human Identity Top 10	NHI-03	Persistent privilege and overbroad access sit at the center of this outage pattern.
NIST CSF 2.0	PR.AC-4	Access permissions directly influenced production impact and operational resilience.
NIST Zero Trust (SP 800-207)	PR.AC	Zero Trust access discipline supports scoped, continuous verification for sensitive changes.

Limit high-risk permissions and validate that access scope matches the change being performed.

Key terms

Standing Privilege: Standing privilege is always-on access that remains available to an identity even when no task is actively in progress. In cloud and identity programmes, it creates operational and security risk because the same entitlement can be used across multiple contexts, increasing the chance of unintended impact.
Zero Standing Privileges: Zero Standing Privileges is an access model where no identity keeps permanent elevated rights. Access is granted only when needed, scoped to the task, and removed automatically after use. The model reduces both breach exposure and accidental blast radius in production environments.
Blast Radius: Blast radius is the amount of damage or disruption that can follow from a single identity action or permission. In identity governance, it is a practical measure of how far access can propagate through shared systems, making it more useful than entitlement counts alone.
Just-in-Time Elevation: Just-in-Time elevation is the temporary granting of extra access for a specific task, usually with an automatic expiry. It is used to reduce standing privilege while still allowing legitimate operational work, especially in environments where shared infrastructure changes can have broad effects.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity lifecycle are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.

This post draws on content published by Apono: When the Internet Blinks: What Cloudflare’s Outage Teaches Us About Standing Privileges. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-11-20.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org