Cloud disaster recovery failed when configuration did

By NHI Mgmt Group Editorial TeamPublished 2025-10-20Domain: Best PracticesSource: ControlMonkey

TL;DR: AWS’s recent outage triggered more than 6.5 million disruption reports worldwide and exposed a harder truth for cloud teams: disaster recovery fails when configuration, dependencies, and drift are not recoverable, according to ControlMonkey and CNN. Data backups alone do not restore operational identity, policy state, or infrastructure topology.

At a glance

What this is: This is an analysis of why cloud disaster recovery breaks when teams can recover data but not configuration, dependencies, and infrastructure state.

Why it matters: It matters because IAM, NHI, and cloud platform teams need recovery plans that restore access paths, policy boundaries, and dependent systems, not just storage.

👉 Read ControlMonkey's analysis of cloud disaster recovery after the AWS outage

Context

Cloud disaster recovery is the ability to restore not just data, but the configuration and dependencies that make services run. The article argues that many recovery plans fail because they treat backups as the finish line, while live cloud operations depend on infrastructure state, region placement, policies, and hidden service dependencies.

For IAM and cloud security teams, that gap matters because recovery is an identity problem as much as a resilience problem. If access controls, infrastructure-as-code coverage, and dependency inventories are incomplete, the environment may come back inconsistent, partially trusted, or slower to recover than the business expects.

Key questions

Q: What breaks when cloud disaster recovery only restores data?

A: Recovery breaks when teams cannot reconstruct the configuration, permissions, and dependencies needed for workloads to run. A backup may restore files, but it does not automatically restore network paths, identity policy, or service topology. That leaves the environment partially restored and slower to recover.

Q: Why do cloud outages expose weaknesses in IAM and configuration management?

A: Because access boundaries and infrastructure state are part of what makes the platform operational. If IAM policies, region dependencies, or infrastructure definitions are incomplete, failover can succeed technically while the service still cannot operate as intended.

Q: How do teams know whether disaster recovery is actually working?

A: They test whether a critical service can be rebuilt end to end from code and snapshots, with permissions intact and dependencies available. If the service only works after manual intervention, the recovery design is not reliable enough.

Q: How should security teams prioritize recovery improvements after a cloud outage?

A: Start with the systems most concentrated in one region or one manual process, then close infrastructure-as-code gaps, automate drift detection, and validate that identity state can be restored with the workload. Resilience depends on the whole operating context.

Technical breakdown

Why configuration recovery matters more than data backup

Backup systems restore content, but they do not automatically restore the operational rules that make cloud services function. In cloud environments, configuration includes IAM policies, network dependencies, service endpoints, region mappings, and provisioning logic. If those elements are not captured and replayable, a restored workload can still fail because it lacks the permissions, topology, or integration paths it needs. Disaster recovery therefore has to preserve state beyond storage. The article’s core point is that resilience depends on rebuilding the whole service fabric, not merely retrieving files.

Practical implication: treat configuration, policy, and dependency recovery as first-class recovery objectives, not supporting tasks.

Infrastructure as code closes the recovery gap

Infrastructure as code turns cloud state into something that can be versioned, audited, and rebuilt predictably. Without it, teams fall back to console changes and ad hoc fixes, which create drift between what exists and what is documented. That drift becomes a recovery failure mode because manual systems are hard to reproduce under pressure. IaC does not eliminate outage impact, but it makes redeployment deterministic and easier to validate. In a multi-account or multi-region environment, that determinism is the difference between controlled restoration and improvised repair.

Practical implication: identify console-managed resources and bring them under code before the next outage exposes them.

Drift detection and dependency mapping make failover believable

Drift happens when live infrastructure no longer matches the declared configuration. In recovery, that mismatch can break redeployment, expose security gaps, or send failover into an environment that looks correct on paper but is operationally incomplete. Dependency mapping matters for the same reason: services rarely fail alone. A regional outage can expose implicit reliance on a single region, shared control plane, or third-party service. The technical lesson is that recovery quality depends on knowing what each workload needs to function and validating that the restored environment matches that need.

Practical implication: automate drift detection and maintain dependency maps for every critical service path.

NHI Mgmt Group analysis

Configuration recoverability is the real cloud resilience test: disaster recovery that only restores data assumes the environment itself is disposable. That assumption breaks in cloud because identity policies, infrastructure state, and service dependencies determine whether workloads can actually run after an outage. The implication is that recovery planning must be built around reconstructing the full operating context, not preserving files alone.

Cloud drift is an identity governance problem, not just an infrastructure problem: when the live environment diverges from code, control boundaries become unreliable and recovery behaviour becomes unpredictable. Access, policy, and topology drift can leave teams restoring into an environment that no longer matches the approved design. Practitioners should treat drift as a governance failure that degrades both resilience and trust.

Identity blast radius: resilience collapses when a single region, platform dependency, or shared configuration path becomes the hidden point of failure. The article shows how easily business services can be concentrated in one unavailable control plane or region without teams realising it. The implication is that recovery architecture must be designed around dependency dispersion, not optimistic assumptions about portability.

Infrastructure-as-code maturity now defines recovery credibility: organisations that still depend on manual console changes cannot prove that they can rebuild services consistently under pressure. That is true for cloud operations, and it is just as true for the identity controls that protect them. The practitioner conclusion is that recovery testing must include code coverage, rollback fidelity, and dependency completeness, not only storage restoration.

Identity and access state must be recoverable alongside workloads: cloud outage response fails when teams can restart compute but cannot re-establish the same access boundaries, policy inheritance, and service permissions. That gap turns an outage into a governance event because the environment may come back with different effective access than before. The implication is that resilience planning must include identity state as part of the recovery unit.

From our research:
Only 44% of organisations have implemented any policies to manage their AI agents, despite 92% agreeing that governing AI agents is critical to enterprise security, according to The 2026 Infrastructure Identity Survey.
69% of security leaders agree identity management must fundamentally shift to address agentic AI systems, according to The 2026 Infrastructure Identity Survey.
For adjacent reading, see the NHI Lifecycle Management Guide for the provisioning, rotation, and offboarding controls that make recovery state governable.

What this signals

Configuration recoverability has become a governance requirement, not an ops preference: cloud resilience now depends on whether teams can reproduce the entire operating state, including identity boundaries, policy inheritance, and dependency mappings. That is where manual cloud management breaks first, and it is why recovery planning should sit alongside IAM and platform governance. With 67% of organisations still relying heavily on static credentials despite the risks they pose to agentic AI deployments, per The 2026 Infrastructure Identity Survey, the same pattern of brittle state is already visible across identity programmes.

Identity blast radius: this is the point at which one missing region, one untracked dependency, or one unmanaged access path can turn a recoverable outage into a prolonged business disruption. Teams should use the outage as a prompt to inventory hidden dependencies, validate recovery runbooks, and separate true redundancy from assumed resilience.

The next maturity step is to make recovery evidence-based. If a workload cannot be rebuilt with the same controls, same dependencies, and same access posture, the programme has not recovered the service, only the data.

For practitioners

Baseline every critical dependency Map services, regions, shared control planes, and third-party dependencies for each critical workload so you know exactly what must be restored together.
Pull console-managed resources into code Identify ClickOps-created or legacy resources and migrate them under Terraform or equivalent infrastructure as code so recovery is reproducible and auditable.
Automate drift detection and remediation Compare live cloud state against declared configuration continuously so recovery does not fail because production no longer matches the runbook.
Test failover with a real workload slice Run a small recovery drill on one critical service, including permissions, network paths, and dependencies, to measure whether restoration actually works.
Snapshot policy and access state daily Capture infrastructure configuration, policy inheritance, and access boundaries so a rollback restores operational control, not just data.

Key takeaways

Cloud disaster recovery fails when teams can restore storage but not the configuration and dependencies that make services operational.
The outage exposes a broader resilience gap: unmanaged drift, hidden dependencies, and manual fixes make recovery unpredictable.
Practitioners need recovery plans that restore identity state, code-defined infrastructure, and verified failover paths together.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

NIST CSF 2.0, NIST Zero Trust (SP 800-207) and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST CSF 2.0	PR.AC-4	Recovery depends on restoring access boundaries and service permissions.
NIST Zero Trust (SP 800-207)	SC-7	Regional failover and dependency dispersion support zero trust segmentation.
NIST CSF 2.0	RC.RP-1	Recovery planning must cover the full operating state, not only data.

Validate recovery designs against SC-7 so restored services do not rely on a single hidden trust path.

Key terms

Cloud Disaster Recovery: Cloud disaster recovery is the discipline of restoring cloud services after an outage by rebuilding data, configuration, dependencies, and access controls together. It goes beyond backup and restore, because a service is not recovered until it behaves correctly in the target environment.
Infrastructure as Code: Infrastructure as code is the practice of defining cloud resources in version-controlled code so they can be deployed, reviewed, and recreated consistently. For resilience, it gives teams a repeatable way to rebuild environments and verify that live state matches approved configuration.
Configuration Drift: Configuration drift is the mismatch between what is running in production and what is defined in code or documentation. In cloud operations, drift weakens recovery because the restored environment may differ from the one the team planned, tested, and approved.
Identity Blast Radius: Identity blast radius is the scope of damage that grows when access, permissions, or dependencies are concentrated in one region, one account, or one control path. In recovery planning, it shows how a single hidden dependency can enlarge the impact of an outage.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity lifecycle are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.

This post draws on content published by ControlMonkey: analysis of cloud disaster recovery after the AWS outage. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-10-20.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org