Subscribe to the Non-Human & AI Identity Journal

Notifications
Clear all

Identity service outages: what IAM teams should do differently


(@nhi-mgmt-group)
Member Moderator
Joined: 1 year ago
Posts: 2827
Topic starter  

TL;DR: Regional AWS failure and cascading third-party outages disrupted Sign-On, AuthKit, and related services, with some request failure rates reaching 100% and roughly 70% of requests failing at peak during the second incident window, according to WorkOS. The lesson is that identity availability now depends on multi-layer resiliency, not just authentication correctness.

NHIMG editorial — based on content published by WorkOS: service disruption on October 20, 2025 and the response plan

By the numbers:

Questions worth separating out

Q: What breaks when identity services depend on a single cloud region?

A: When identity services depend on a single cloud region, the login path can fail even if the rest of the application is intact.

Q: Why do third-party dependencies make auth outages worse?

A: Third-party dependencies make auth outages worse because they expand the failure domain beyond the identity platform itself.

Q: How do teams know if identity failover is actually working?

A: Teams know failover is working when sign-in, hosted pages, and admin functions recover cleanly during controlled dependency loss, not just when diagrams show a secondary region exists.

Practitioner guidance

  • Map every identity-critical dependency chain Document the services behind sign-in, session refresh, hosted pages, credential retrieval, and admin functions so you can see which upstreams share the same failure domain.
  • Test fallback behavior for identity SDKs Review timeout settings, circuit breakers, and fail-closed versus fail-open defaults for any SDK used in authentication or hosted identity flows.
  • Validate multi-region recovery under degraded monitoring Run failover exercises where observability is partially missing and deployment pipelines are slowed or broken, because that is when recovery logic is most likely to fail.

What's in the full analysis

WorkOS's full post covers the operational detail this post intentionally leaves for the source:

  • The minute-by-minute incident timeline, including escalation points and restoration checkpoints for each outage window.
  • The specific backend failure modes behind database credential retrieval, pod recycling, and feature-flag request hangs.
  • The immediate and Q1 2026 remediation plan, including hot standby, multi-region deployment, and graceful degradation changes.
  • The service-by-service impact breakdown showing which WorkOS products were affected and which remained stable.

👉 Read WorkOS's incident analysis of the October 20 service disruption →

Identity service outages: what IAM teams should do differently?

Explore further

View Full Forum →  |  NHI Foundation Course →



   
Quote
Share: