TL;DR: Cloud backup failures often stem from broken recovery assumptions, not missing data, because teams can restore files yet still fail to rebuild permissions, dependencies, and infrastructure state, according to ControlMonkey. The real control problem is validating full system recovery, not treating backup storage as proof of disaster recovery readiness.
At a glance
What this is: This is an analysis of why cloud backup programmes fail at recovery even when backups exist, with the key finding that infrastructure state, drift, and dependency gaps are the real blockers.
Why it matters: It matters because IAM, NHI, and platform teams all depend on recoverable permissions and relationships, not just preserved data, when they need to restore services under pressure.
By the numbers:
- Downtime for Fortune 100 companies can cost between $500,000 and $1 million per day.
👉 Read ControlMonkey's analysis of cloud backup mistakes and recovery gaps
Context
Cloud backup is not the same thing as disaster recovery. Backup protects data, but recovery depends on whether the infrastructure, permissions, and dependencies around that data can be recreated in a working state. For identity teams, that means recovery is as much about access paths and environment state as it is about storage.
Cloud environments make that distinction harder to ignore because state changes continuously. IaC, manual console fixes, unmanaged drift, and changing relationships between services mean the system you think you can restore is often not the system that actually exists. This is why cloud backup mistakes become identity and recovery problems, not just storage problems.
Key questions
Q: How should security teams test whether cloud recovery actually works?
A: They should run full recovery exercises that rebuild the environment, not just restore data. The test should confirm IAM permissions, network paths, dependencies, and application behavior all work together under outage conditions. If a restored system cannot run as intended, the backup programme has only proven retention, not recovery.
Q: Why do backups still fail during cloud outages even when the data is intact?
A: Because the backup may be correct while the infrastructure around it is not. Recovery depends on permissions, network configuration, service relationships, and current cloud state. If those elements drifted or were never captured, the team must reconstruct them during an outage, which extends downtime and increases error risk.
Q: What breaks when infrastructure drift is not tracked continuously?
A: Recovery breaks first, because teams no longer know which configuration is authoritative. Drift creates uncertainty about permissions, dependencies, and the correct runtime state, so restoration turns into guesswork. That makes even good backups harder to use and can prevent systems from returning to service on time.
Q: Who is accountable when cloud backup fails to support recovery?
A: Accountability sits with the teams that own infrastructure state, identity controls, and recovery testing, not only with backup operators. Frameworks such as the NIST Cybersecurity Framework 2.0 expect resilience to include recovery, so the programme owner must verify that backups, access, and rebuild paths all work together.
Technical breakdown
Why full recovery is different from data restore
A data restore proves that bytes can be recovered. Full recovery proves that the service can run again with the right permissions, networking, dependencies, and runtime assumptions intact. In cloud environments, those layers are often external to the backup itself. That is why a backup can be healthy while the real environment remains unrecoverable. The failure mode is not missing data. It is missing operational context. Practical implication: test the rebuilt service, not just the restored dataset.
Practical implication: test the rebuilt service, not just the restored dataset.
How drift breaks infrastructure recovery
Drift appears when live cloud state diverges from Terraform or other source definitions. Emergency fixes, ClickOps changes, temporary overrides, and AI-generated modifications can all create a gap between declared and actual state. During recovery, that gap forces teams to reconcile instead of restore, which slows restoration and introduces uncertainty about the correct configuration. Practical implication: compare declared infrastructure with actual cloud state continuously, because recovery depends on the real version of the environment.
Practical implication: compare declared infrastructure with actual cloud state continuously, because recovery depends on the real version of the environment.
Why relationships matter more than snapshots
Snapshots preserve a moment in time, but they do not preserve the operating logic of a system. Cloud services depend on permission chains, network paths, resource relationships, and version history. If those relationships are not captured, the restored system may look correct while still failing in practice. That is why point-in-time backups are incomplete on their own. Practical implication: capture dependency and access relationships alongside snapshot data so recovery can rebuild working state, not just inventory.
Practical implication: capture dependency and access relationships alongside snapshot data so recovery can rebuild working state, not just inventory.
Threat narrative
Attacker objective: The attacker or outage condition exploits the gap between backed-up data and recoverable infrastructure state, extending downtime and preventing a clean restoration of service.
- Entry begins when a cloud environment changes outside the expected IaC path, such as via ClickOps, manual remediation, or untracked AI-driven modifications.
- Credential access or abuse is not the only issue here. The decisive failure is that recovery depends on permissions, dependencies, and state that were never captured or tracked correctly.
- Impact arrives during an outage or ransomware event when teams discover they can restore data but cannot recreate the environment fast enough to meet RTO.
Breaches seen in the wild
- Cisco DevHub NHI breach — IntelBroker exploited exposed Cisco credentials, API tokens and keys in DevHub.
- LiteLLM PyPI package breach — LiteLLM PyPI supply chain attack, credentials stolen from users.
Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.
NHI Mgmt Group analysis
Backup without recoverable infrastructure state is not recovery. The article is right to separate data protection from system reconstruction, because most cloud outages fail on the second problem. That gap matters across IAM, NHI, and platform operations, where permissions and dependencies are part of the system itself. Practitioners should treat recoverability as a state-management problem, not a storage problem.
Untracked drift is an identity and recovery control failure, not just a configuration annoyance. When live cloud state diverges from declared infrastructure, access paths, dependencies, and service relationships can no longer be trusted during restoration. That means the organisation no longer knows which version of the environment is authoritative. The implication is that recovery planning must be built around observed state, not assumed state.
Infrastructure relationships are the real recovery asset. Backups preserve data, but the relationships among services, permissions, and network paths determine whether that data is usable. This is where identity governance and disaster recovery meet: if access cannot be recreated cleanly, the system cannot be recovered cleanly. Practitioners should think of permissions lineage and dependency mapping as part of resilience architecture.
Cloud recovery exposure grows whenever teams rely on point-in-time snapshots as proof of resilience. A snapshot captures a moment, not a functioning system. In practice, the organisation still has to reconstruct change history, runtime dependencies, and operating context. That is why snapshot confidence often collapses under real outage pressure. The implication is that recovery assurance must be measured by live rebuild outcomes, not by backup presence.
Identity blast radius becomes a recovery variable when AI-driven changes are not governed. The article’s reference to AI-generated infrastructure changes is a useful signal that untracked modification speed now matters as much as human error. When change volume accelerates, recovery teams inherit more unknowns and less trustworthy state. The implication is that identity, automation, and recovery controls now need to be designed together.
From our research:
- Systems with least-privileged AI access had a 17% incident rate vs 76% for over-privileged systems, according to The 2026 Infrastructure Identity Survey.
- From our research: Only 44% of organisations have implemented any policies to manage their AI agents, despite 92% agreeing that governing AI agents is critical to enterprise security, according to The 2026 Infrastructure Identity Survey.
- For teams extending recovery governance into infrastructure identity, the right next step is the Top 10 NHI Issues, which frames the control gaps that make hidden state harder to recover.
What this signals
Infrastructure recovery is now an identity-state problem as much as a data-protection problem. Once cloud environments change outside declared workflows, the trust boundary shifts from the backup repository to the live state of permissions, dependencies, and network relationships. Teams that still measure resilience only by stored copies will miss the point where recovery actually fails.
With 70% of organisations already granting AI systems more access than human employees, per The 2026 Infrastructure Identity Survey, the operational risk is no longer limited to human-driven drift. AI-generated changes increase the number of state transitions that recovery teams must explain, version, and trust. That makes auditable change control part of resilience planning, not just governance overhead.
The practical signal for security and platform teams is that recovery readiness now depends on observability of state, not confidence in backup tooling. If the team cannot answer what changed, what dependencies exist, and which permissions are current, then the restore path is still incomplete. That is the control gap to close before the next outage.
For practitioners
- Test full recovery, not just restore jobs Run disaster drills that rebuild the service end to end, including IAM permissions, networking, dependencies, and runtime validation. A successful file restore is not evidence of recoverability until the rebuilt environment actually runs.
- Track live infrastructure state against declared IaC Continuously compare Terraform or other declared definitions with the actual cloud environment, and flag drift as a recovery risk. Treat manual console changes and emergency fixes as state changes that must be captured.
- Capture permissions and dependency relationships Document how services, roles, network paths, and upstream dependencies fit together so recovery can reconstruct working access paths, not only resource inventories. This is especially important where identity and service interdependence is high.
- Govern all infrastructure changes, including AI-generated ones Ensure every change path is versioned, auditable, and visible to the recovery process, including modifications created by AI agents or automation. What is not tracked cannot be restored with confidence.
Key takeaways
- Cloud backup problems become disaster recovery failures when teams cannot rebuild the full environment, including permissions and dependencies.
- The evidence of resilience is not a healthy backup job, but a tested ability to return the service to working state within the RTO.
- Drift, untracked changes, and missing relationship data are the controls that most directly determine whether restoration becomes guesswork.
Standards & Framework Alignment
This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.
OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.
| Framework | Control / Reference | Relevance |
|---|---|---|
| NIST CSF 2.0 | RC.RP | Recovery planning is central to the article's full-system rebuild focus. |
| NIST Zero Trust (SP 800-207) | PR.AC-4 | Recovery depends on permissions and access paths being recreated correctly. |
| OWASP Non-Human Identity Top 10 | NHI-03 | Untracked machine and AI-driven changes expose non-human identity governance gaps. |
Validate that rebuilt environments re-establish least-privilege access before declaring recovery complete.
Key terms
- Recovery Time Objective: The maximum acceptable time to restore a service after disruption. In cloud environments, RTO is not satisfied by restoring files alone. The environment, identity paths, and dependencies must also return to a usable state within the target window.
- Infrastructure Drift: The divergence between declared infrastructure and the actual cloud state. Drift can come from manual fixes, emergency changes, or automation. It matters because recovery uses the live environment as the source of truth, not the intended configuration.
- Full Recovery Test: A drill that validates the entire rebuild path, not only data restoration. It checks whether permissions, networking, dependencies, and application behavior can all be re-established. This is the practical test of whether backup supports real resilience.
- Infrastructure Relationships: The permissions, network paths, service dependencies, and operational links that make cloud resources function as a system. Backups preserve data, but relationships determine whether that data can be used after restoration. Recoverability depends on capturing both.
Deepen your knowledge
Cloud recovery, infrastructure state, and identity relationships are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If your team is trying to govern recovery across cloud and identity boundaries, it is worth exploring.
This post draws on content published by ControlMonkey: cloud backup mistakes and recovery gaps. Read the original.
Published by the NHIMG editorial team on 2026-04-14.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org