TL;DR: Dynatrace dashboards, alerts, monitors, and metrics can now be backed up and restored through versioned snapshots as part of a cloud disaster recovery platform, reducing recovery time after misconfigurations, incidents, or ransomware, according to ControlMonkey. The real governance issue is not data loss alone but the loss of monitoring control plane state, which can leave teams blind when they most need observability.
At a glance
What this is: This is a ControlMonkey announcement about backing up and restoring Dynatrace configurations as part of cloud disaster recovery.
Why it matters: It matters because observability tooling is part of the operational control plane, and losing its configuration can slow detection, response, and recovery across infrastructure and identity-dependent environments.
By the numbers:
👉 Read ControlMonkey's Dynatrace configuration backup and recovery announcement
Context
Dynatrace configuration backup is about preserving the monitoring control plane, not just preserving telemetry. Dashboards, alerts, monitors, and metrics are operational assets, and when they are deleted or corrupted during an incident, teams lose visibility at the exact point where they need it most.
For IAM, NHI, and platform teams, this sits in the same governance bucket as protecting service accounts, API keys, and other non-human identities. Recovery is no longer only about servers and data. It also includes the configuration state that tells operators what is happening, where it is happening, and whether response controls are still functioning.
Key questions
Q: How should security teams recover observability platforms after a configuration loss?
A: Security teams should treat observability recovery as a configuration restoration problem, not a rebuild from scratch. The right approach is to maintain versioned snapshots of dashboards, alerts, monitors, and metrics, then test whether those artifacts can be restored into a known-good state quickly enough to support incident response. Recovery only works if it restores operational context, not just software availability.
Q: Why does observability configuration deserve the same protection as infrastructure?
A: Because configuration defines how the platform behaves during an incident. If dashboards, alerts, or monitors are altered or deleted, the team may still have telemetry but lose the logic needed to interpret it. That turns a recoverable event into a visibility problem, which delays detection and response. Protecting the configuration layer preserves the decision-making structure of the monitoring environment.
Q: What breaks when monitoring settings are not recoverable?
A: What breaks first is trust in the monitoring layer. Alerting may no longer reflect the current environment, dashboards may mislead responders, and investigation becomes slower because the team cannot prove what changed. In practice, that means the organisation may be operationally exposed even when core systems are still online, because response depends on the integrity of the observability state.
Q: Which teams should own observability disaster recovery?
A: Ownership should sit jointly with platform engineering, SRE, and security operations, because observability recovery affects both service continuity and incident response. The teams that manage the monitoring tools need to define snapshots, restore procedures, and access controls, while security leadership ensures those controls are tested and governed. If no one owns the recovery path, the monitoring plane becomes a single point of operational failure.
How it works in practice
Why observability configuration is a recoverable asset
Observability platforms do not just collect signals. They encode detection logic, routing rules, alert thresholds, and operational context in configuration. If those settings are changed, deleted, or drift out of alignment, the platform may still be online while the team has effectively lost its control layer. That is why backup and restore need to cover the configuration state itself, not only the data being observed. In practice, the important asset is the known-good monitoring posture, because that is what lets teams distinguish noise from an actual incident.
Practical implication: treat observability configuration as part of disaster recovery scope, not as a separate admin concern.
Versioned snapshots and recovery of monitoring state
Versioned snapshots create a rollback path for monitoring assets by preserving point-in-time configuration states. That matters because observability failures often come from gradual misconfiguration, accidental deletion, or unauthorized change rather than total platform outage. When backup captures dashboards, monitors, alerts, and metrics together, teams can restore a coherent monitoring model instead of rebuilding it manually from memory. This is especially useful when multiple teams manage different views of the same environment and need to recover consistent alerting fast after an outage or attack.
Practical implication: define snapshot frequency and restore ownership before an incident forces a manual rebuild.
Unified disaster recovery across cloud and SaaS tools
A unified disaster recovery method tries to close the gap between cloud infrastructure recovery and SaaS configuration recovery. That gap matters because a platform can be resilient at the compute layer while still being operationally fragile if the monitoring system, ticketing flow, or alert routing is lost. For identity and platform teams, the key issue is governance consistency across tools that support detection and response. If those tools are not managed with the same discipline as infrastructure, recovery becomes fragmented and slower than the outage itself.
Practical implication: include monitoring platforms in the same resilience review as core infrastructure and adjacent SaaS services.
NHI Mgmt Group analysis
Monitoring configuration is part of the operational identity plane, not an auxiliary admin setting. Dashboards, alerts, monitors, and metrics are the ruleset that tells teams what matters and when to respond. When that configuration disappears, the organisation still has tools but has lost the decision structure around those tools. Practitioners should therefore treat observability state as governed operational infrastructure, not disposable UI metadata.
Resilience for observability is a control-plane problem, not a storage problem. Backup is useful only if it preserves the exact configuration state needed to reconstruct detection and response behaviour. Versioned snapshots matter because they preserve the environment’s prior operating logic, which is what teams need after misconfiguration or ransomware. The practitioner lesson is to govern recoverability at the configuration layer, not only at the data layer.
ControlMonkey’s announcement reflects a broader shift from recovery of assets to recovery of operational intent. Mature resilience programmes increasingly need to restore not just systems, but the policies and thresholds that make those systems usable under stress. That aligns with NIST Cybersecurity Framework 2.0 recovery thinking, where restoration is only meaningful if the restored service supports detection and response. Teams should evaluate whether their DR scope includes the tools that interpret infrastructure, not just the infrastructure itself.
Configuration drift in observability is a hidden failure mode because it degrades trust before it causes outage. Teams often notice the problem only after an incident when alerts fail to fire or dashboards no longer reflect reality. That makes monitoring recovery a governance issue for platform, SRE, and security operations leaders together. The practical conclusion is to measure configuration integrity with the same seriousness as service availability.
Observability backup is becoming a prerequisite for cross-tool resilience governance. As cloud and SaaS operations converge, recovery models must span more than one control domain. A post-incident environment that restores workloads but not alerting logic is still operationally incomplete. Practitioners should evaluate resilience across the full monitoring stack, including the SaaS systems that expose and explain cloud risk.
From our research:
- A 30-min meeting will save your team 1000s of hours, according to The 2026 Infrastructure Identity Survey.
- Only 44% of organisations have implemented any policies to manage their AI agents, despite 92% agreeing that governing AI agents is critical to enterprise security.
- That governance gap is why teams should also review the NHI Lifecycle Management Guide for lifecycle controls that keep operational identity state recoverable.
What this signals
Configuration recovery is becoming part of identity-adjacent resilience planning. As observability and automation platforms spread across cloud operations, teams need to decide which control-plane settings are recoverable and who owns that recovery. The practical shift is toward treating operational context as a governed asset, not a convenience layer.
With 67% of organisations still relying heavily on static credentials despite the risks they pose to agentic AI deployments, per The 2026 Infrastructure Identity Survey, resilience programmes that ignore access pathways and control-plane state will stay incomplete.
Resilience scope will keep expanding beyond infrastructure. Organisations that can restore workloads but not their monitoring logic will still struggle during an outage. The next maturity step is to align platform, security, and IAM teams around restoring the systems that define what the organisation can see and verify.
For practitioners
- Include observability configuration in disaster recovery scope Map dashboards, monitors, alerts, and metrics to the same recovery objectives used for cloud workloads and management tools. If a team cannot restore those configurations quickly, the DR plan is incomplete.
- Define a known-good monitoring baseline Capture approved configurations for critical observability assets so teams can restore an intact alerting posture after accidental deletion or malicious change. Baselines should be owned, reviewed, and versioned.
- Test restore workflows for monitoring platforms Run recovery drills that rebuild observability state from snapshots, not just infrastructure from templates. Measure whether responders can recover alert routing and dashboard context without manual redesign.
- Review access to observability administration APIs Limit who can change alerting logic, thresholds, and dashboard structures, and log those changes as operationally sensitive events. Configuration tampering should be detectable before it becomes a visibility failure.
Key takeaways
- Observability backup is a governance issue because configuration loss can disable the alerting logic responders depend on.
- The practical risk is operational blindness, not just inconvenience, because deleted or altered monitoring state slows incident detection and recovery.
- Teams should test whether they can restore dashboards, alerts, monitors, and metrics into a known-good state before they need that capability in an outage.
Standards & Framework Alignment
This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.
OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.
| Framework | Control / Reference | Relevance |
|---|---|---|
| NIST CSF 2.0 | RC.RP-1 | Recovery planning applies directly to restoring observability configuration after incidents. |
| NIST Zero Trust (SP 800-207) | PR.AC-4 | Access control over admin APIs is central to protecting monitoring configuration. |
| OWASP Non-Human Identity Top 10 | NHI-03 | Configuration access and governance for machine-administered tools map to NHI control discipline. |
Review which non-human identities can alter observability tools and limit them to least privilege.
Key terms
- Observability configuration: The saved settings that determine how a monitoring platform behaves, including dashboards, alerts, monitors, thresholds, and routing rules. It is the operational logic of the observability layer, and if it is lost or altered, the platform may still exist while response quality collapses.
- Known-good snapshot: A point-in-time copy of system state that has been validated as safe to restore. In monitoring and resilience programmes, it provides a trusted recovery baseline so teams can revert accidental changes, recover from compromise, and re-establish the intended operational model.
- Control plane resilience: The ability to preserve and restore the management logic that governs how systems are observed, controlled, and operated. It goes beyond uptime and data durability by ensuring the team can still direct, interpret, and trust the environment after disruption.
Deepen your knowledge
NHI governance, agentic AI identity, and machine identity lifecycle are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or governance maturity, it is worth exploring.
This post draws on content published by ControlMonkey: Dynatrace configuration backup and recovery for cloud disaster recovery. Read the original.
Published by the NHIMG editorial team on 2026-03-25.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org