NHI Foundation Level Training Course Launched
NHI Forum

Notifications
Clear all

Redefining Site Reliability Engineering for the AI Era


(@gitguardian)
Trusted Member
Joined: 8 months ago
Posts: 33
Topic starter  

Read full article here: https://blog.gitguardian.com/sreday-sf-2025/?utm_source=nhimg

 

At SREday San Francisco 2025, reliability engineers, DevOps leaders, and IT professionals gathered to discuss one defining theme: automation can enhance reliability, but only human judgment sustains it. Against the backdrop of San Francisco’s century-old cable cars, a living symbol of mechanical precision guided by human skill — the event highlighted how Site Reliability Engineering (SRE) must evolve to balance automation, observability, and trust in an era of AI-driven operations and non-human identities (NHIs).

Across two tracks and more than twenty sessions, experts explored what it means to practice human-centered SRE when intelligent systems and agentic automation are now deeply embedded in production environments. The consensus was clear: dashboards do not defend anything on their own. Meaningful resilience depends on people, context, and decision-making that technology cannot replace.

 

Key Highlights from SREday SF 2025

  1. Automation that Listens to People

Jimmy Katiyar, Senior Product Manager at SiriusXM, emphasized that automation should amplify engineers rather than replace them. His talk, “The Human Factor in Site Reliability: Designing Automation That Amplifies Engineering,” focused on keeping humans in the decision loop, especially during ambiguous situations.

Katiyar demonstrated how pairing runbook automation with human decision points reduced recovery times and improved customer trust. The message: allow humans to pause or override automation when confidence is low, and build systems that value human intuition alongside algorithmic speed.

 

  1. Observability that Drives Decisions

In “From Dashboard to Defense: Automating Resilience at Large Scale,” Sureshkumar Karuppuchamy, Engineering Lead at eBay, explained why dashboards without context are meaningless. He encouraged teams to focus on Service Level Indicators (SLIs) like latency and checkout success instead of vanity metrics.

Karuppuchamy introduced the “staged autonomy pattern” — starting with shadow mode for observation, moving to suggest mode for supervised learning, and finally to autonomous mode with transparency and reversibility. This structured progression mirrors identity and access control strategies for non-human identities (NHIs), ensuring automation remains accountable and explainable.

 

  1. Chaos Experiments That Generate Real Insights

AWS leaders Saurabh Kumar and Ruskin Dantra revealed how Generative AI can transform chaos engineering from experimentation to continuous learning. Their session showed that the challenge isn’t simulating failures but forming precise hypotheses and verification loops tied to business outcomes.

They urged teams to define “steady state” in measurable terms such as API performance, latency, and user experience metrics. When linked to financial or user-impact measures, chaos becomes a trust-building tool, not a stunt. The process bridges reliability testing and security observability, exposing weak trust boundaries before attackers can.

 

  1. Data That Tells the Truth About Usage

Avi Press, Founder and CEO of Scarf, challenged the industry’s obsession with download counts. In his session “10 Billion Downloads: Insights and Trends in Open Source,” he explained that downloads are not users. Automated systems inflate metrics, hiding real usage and security risk trends.

He called for better telemetry in supply chain security, distinguishing human vs. automated consumption to inform vulnerability management and dependency policies. His findings also exposed a critical risk: most traffic still pulls “latest” versions of packages, while pinned versions with known vulnerabilities often persist indefinitely — creating a silent but growing attack surface.

 

Core Takeaways: The Case for Human-Centered Automation

  • Ambiguity Defeats Static Policy: Metrics and logs compress context. Without human interpretation, AI may optimize for the wrong objectives and miss adversarial signals.
  • Incentives Warp Signals: Teams often collect cheap or incomplete telemetry for speed, leading to brittle systems that fail under real-world complexity.
  • Verification Lags Change: Experiments and audits often occur post-incident. The new goal is continuous verification aligned to business outcomes.
  • Trust Is the Real SLO: Reliability and security both depend on trust — between engineers, users, and automated systems. Transparency and reversibility are the foundations of that trust.

 

The Future of SRE: Shared Control Between Humans and Machines

The event concluded with a clear principle: resilience grows from collaboration, not full autonomy. Just like San Francisco’s cable cars, reliability in modern infrastructure depends on skilled operators who understand when to intervene.

To build truly trustworthy systems, SREs must integrate observable, auditable, and reversible automation, especially as AI and NHIs expand across cloud-native architectures. Whether tuning alerting systems, refining chaos experiments, or managing secrets at scale, the craft of SRE is ultimately about shared control between humans and intelligent systems.

 



   
Quote
Share: