Subscribe to the Non-Human & AI Identity Journal
Home FAQ Architecture & Implementation Patterns Why do backups still fail during cloud outages…
Architecture & Implementation Patterns

Why do backups still fail during cloud outages even when the data is intact?

← Back to all FAQ
By NHI Mgmt Group Editorial Team Updated June 10, 2026 Domain: Architecture & Implementation Patterns

Because the backup may be correct while the infrastructure around it is not. Recovery depends on permissions, network configuration, service relationships, and current cloud state. If those elements drifted or were never captured, the team must reconstruct them during an outage, which extends downtime and increases error risk.

Why This Matters for Security Teams

Backups fail during cloud outages when recovery depends on more than the stored data. The backup object can be intact while identity, networking, DNS, KMS access, and service dependencies are unavailable or no longer match the production state. That is why recovery planning has to cover the control plane as well as the data plane, especially in environments with frequent change and shared responsibility. NIST’s Cybersecurity Framework 2.0 treats recovery as an operational capability, not a storage feature. NHIMG has documented how cloud identity and access drift create real blast-radius problems in incidents such as the Snowflake breach, where access paths and governance mattered as much as data exposure. In practice, many security teams discover backup weakness only after an outage exposes missing permissions, broken dependencies, or untested restore paths, rather than through intentional recovery testing.

How It Works in Practice

A reliable backup strategy must answer two questions: can the data be restored, and can the environment needed to use that data be recreated? The second question is where many cloud recovery plans break down. Cloud outages often disable the management plane, interfere with identity providers, or disrupt the services that backups depend on for key retrieval, network routing, and application startup. If those dependencies were never captured as code or validated as part of the recovery process, the organisation is forced to rebuild them manually under pressure. Practitioners should treat recovery as a chain of prerequisites:
  • Identity access for break-glass roles, backup operators, and service accounts
  • Key management availability for decrypting backup sets and rehydrated workloads
  • Network configuration, including routing, security groups, and private endpoints
  • Application dependencies such as databases, queues, and configuration stores
  • Restore sequencing, because one system may not boot until another is present
That is why backup validation has to include full restore drills, not just checksum verification. Cloud-native controls also need to be versioned alongside infrastructure, otherwise the restored data lands in an environment with stale policies and broken trust boundaries. NHIMG’s 230M AWS environment compromise and the Codefinger AWS S3 ransomware attack both underscore a recurring lesson: storage durability does not guarantee operational recoverability. These controls tend to break down when the outage affects the identity or control plane because restore access, encryption access, and orchestration paths all fail together.

Common Variations and Edge Cases

Tighter backup controls often increase operational overhead, requiring organisations to balance faster recovery against the complexity of maintaining duplicate access paths, keys, and environments. There is no universal standard for this yet, but current guidance suggests that the most resilient design separates backup immutability from restore independence. Some common edge cases change the answer:
  • Encrypted backups are useless if the key service is region-locked or tied to the same outage domain.
  • Immutable object storage still fails recovery if IAM policies, SCPs, or federation are unavailable.
  • Cross-region copies help with data durability, but they do not solve application dependency drift.
  • Managed backup services reduce operational burden, but they can create a single recovery dependency if the provider control plane is impaired.
For cloud and NHI-heavy environments, the practical fix is to test restores using separate recovery identities, independent access controls, and documented manual fallbacks. NHIMG’s Ultimate Guide to NHIs is a useful reference for understanding why service identities and permissions must be recoverable alongside the workload. The core issue is not whether the backup exists, but whether the organisation can still authenticate, authorize, decrypt, and reassemble the environment when the cloud platform itself is degraded.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST AI RMF set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
NIST CSF 2.0RC.RP-1Recovery planning is directly about restoring services after a cloud outage.
OWASP Non-Human Identity Top 10NHI-03Backup access often fails when non-human identities and secrets are not recoverable.
NIST AI RMFAI RMF helps govern automated recovery and change decisions in complex cloud environments.

Track NHI credentials, rotation, and break-glass access as part of backup recovery design.

NHIMG Editorial Note
Reviewed and updated by the NHIMG editorial team on June 10, 2026.
NHI Mgmt Group — the #1 independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org