Subscribe to the Non-Human & AI Identity Journal

Notifications
Clear all

AI SRE agents and incident repair: are your controls keeping up?


(@nhi-mgmt-group)
Member Moderator
Joined: 1 year ago
Posts: 2827
Topic starter  

TL;DR: Ciroos says its multi-agent AI SRE system can identify root cause, collect evidence, and generate remediation steps before humans join the incident, while enterprises are already deploying it in production, according to WorkOS. That shifts the governance problem from observability volume to approval boundaries, evidence quality, and delegated action control.

NHIMG editorial — based on content published by WorkOS: Ciroos is building AI SREs that can actually fix things

Questions worth separating out

Q: How should teams govern AI SRE agents that investigate incidents?

A: Start by separating investigative access from remediation authority.

Q: Why do AI incident response agents create new IAM risk?

A: They turn observability into a privileged workflow.

Q: What breaks when an AI SRE agent can both diagnose and act?

A: The boundary between detection and remediation collapses.

Practitioner guidance

  • Separate investigative and remediation identities Give AI SRE agents distinct credentials for evidence gathering and for any action that can modify production state.
  • Require evidence-backed approval for every proposed fix Treat agent output as a recommendation until a human verifies the evidence chain, root-cause logic, and rollback path.
  • Limit agent access to the smallest useful operational scope Scope each agent to the domain it actually investigates, such as Kubernetes, cloud, or application logs, and deny lateral access to unrelated systems unless the investigation explicitly requires it.

What's in the full article

WorkOS's full interview covers the operational detail this post intentionally leaves for the source:

  • Ronak Desai's description of how the multi-agent system divides work across network, security, cloud, application, and Kubernetes contexts
  • The specific path from read-only access to autopilot mode that the interview discusses for enterprise deployment
  • Why large enterprises and small teams are both asking for AI-assisted reliability work
  • The role of enterprise authentication and authorisation plumbing in making production deployment practical

👉 Read WorkOS's interview on Ciroos building AI SRE agents that fix incidents →

AI SRE agents and incident repair: are your controls keeping up?

Explore further

View Full Forum →  |  NHI Foundation Course →



   
Quote
(@mr-nhi)
Member Moderator
Joined: 4 weeks ago
Posts: 1125
 

AI SRE systems expose a governance gap between observation and action. The article shows a system that can inspect telemetry, identify a likely root cause, and draft remediation before a human joins the incident. That means the old assumption that investigation is passive and intervention is deliberate no longer holds. Practitioners should treat agentic incident tooling as a privileged executor with bounded operational power, not as an enhanced dashboard.

A few things that frame the scale:

  • Organisations maintain an average of 6 distinct secrets manager instances, creating fragmentation that undermines centralised control, according to The State of Secrets in AppSec.
  • Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap.

A question worth separating out:

Q: Should organisations let AI agents move from read-only to autopilot?

A: Only after they can prove that the agent’s actions are bounded, reversible, and fully auditable. The main decision is not whether the model is accurate enough. It is whether the organisation can constrain what the agent may change, verify why it changed it, and recover safely if the change was wrong.

👉 Read our full editorial: AI SRE agents are changing how enterprises handle incident repair



   
ReplyQuote
Share: