Notifications

Clear all

Identity service outages: what IAM teams should do differently

Last Post

RSS

NHI Mgmt Group

(@nhi-mgmt-group)

Member Moderator

Joined: 1 year ago

Posts: 12212

Topic starter 07/06/2026 9:06 pm

TL;DR: Regional AWS failure and cascading third-party outages disrupted Sign-On, AuthKit, and related services, with some request failure rates reaching 100% and roughly 70% of requests failing at peak during the second incident window, according to WorkOS. The lesson is that identity availability now depends on multi-layer resiliency, not just authentication correctness.

NHIMG editorial — based on content published by WorkOS: service disruption on October 20, 2025 and the response plan

By the numbers:

AuthKit was heavily impacted with a failure rate of 100%.
Single Sign-On was heavily impacted with a failure rate of 50%.

Questions worth separating out

Q: What breaks when identity services depend on a single cloud region?

A: When identity services depend on a single cloud region, the login path can fail even if the rest of the application is intact.

Q: Why do third-party dependencies make auth outages worse?

A: Third-party dependencies make auth outages worse because they expand the failure domain beyond the identity platform itself.

Q: How do teams know if identity failover is actually working?

A: Teams know failover is working when sign-in, hosted pages, and admin functions recover cleanly during controlled dependency loss, not just when diagrams show a secondary region exists.

Practitioner guidance

Map every identity-critical dependency chain Document the services behind sign-in, session refresh, hosted pages, credential retrieval, and admin functions so you can see which upstreams share the same failure domain.
Test fallback behavior for identity SDKs Review timeout settings, circuit breakers, and fail-closed versus fail-open defaults for any SDK used in authentication or hosted identity flows.
Validate multi-region recovery under degraded monitoring Run failover exercises where observability is partially missing and deployment pipelines are slowed or broken, because that is when recovery logic is most likely to fail.

What's in the full analysis

WorkOS's full post covers the operational detail this post intentionally leaves for the source:

The minute-by-minute incident timeline, including escalation points and restoration checkpoints for each outage window.
The specific backend failure modes behind database credential retrieval, pod recycling, and feature-flag request hangs.
The immediate and Q1 2026 remediation plan, including hot standby, multi-region deployment, and graceful degradation changes.
The service-by-service impact breakdown showing which WorkOS products were affected and which remained stable.

👉 Read WorkOS's incident analysis of the October 20 service disruption →

Identity service outages: what IAM teams should do differently?

Explore further

View Full Forum → | NHI Foundation Course →

Quote

Topic Tags

Mr NHI

(@mr-nhi)

Member Moderator

Joined: 2 months ago

Posts: 11787

08/06/2026 8:54 am

Identity availability is now a control-plane governance problem, not an uptime metric. This incident shows that authentication, directory sync, and admin access depend on a chain of upstream services that must all behave during partial failure. When one region and then multiple third parties falter, the identity layer becomes the first business dependency to fail. Practitioners should treat identity service resilience as part of access governance, because unavailable auth is effectively denied access at scale.

A few things that frame the scale:

The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap.

A question worth separating out:

Q: Who is accountable when identity availability fails across vendors?

A: Accountability stays with the service owner, even when the outage is triggered by cloud or third-party failure. Identity governance does not end at the vendor boundary, because users experience one access service, not a chain of contracts. Teams must define ownership for dependency risk, recovery testing, and communication so outages can be managed as an access problem, not only an infrastructure event.

👉 Read our full editorial: WorkOS outage shows why identity services need resilience beyond IAM

ReplyQuote

Forum Statistics

11 Forums

13.5 K Topics

25.8 K Posts

157 Online

135 Members

Latest Post: Silk Typhoon arrest and exposed credentials: what do teams need to watch? Our newest member: Alex Recent Posts Unread Posts Tags

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Get in Touch

Quick Links

FAQ

NHI 101 Articles

Legal & Policies