WorkOS outage shows why identity services need resilience beyond IAM

By NHI Mgmt Group Editorial TeamPublished 2025-10-23Domain: Breaches & IncidentsSource: WorkOS

TL;DR: Regional AWS failure and cascading third-party outages disrupted Sign-On, AuthKit, and related services, with some request failure rates reaching 100% and roughly 70% of requests failing at peak during the second incident window, according to WorkOS. The lesson is that identity availability now depends on multi-layer resiliency, not just authentication correctness.

At a glance

What this is: WorkOS describes a two-stage outage in which AWS regional failure and third-party dependency failures reduced identity service availability across several products.

Why it matters: IAM teams should treat identity platforms as availability-critical infrastructure, because outages in auth, directory, and admin paths can halt access even when existing sessions stay live.

By the numbers:

AuthKit was heavily impacted with a failure rate of 100%.
Approximately 70% of requests for these services failed during the peak of this period.
Single Sign-On was heavily impacted with a failure rate of 50%.

👉 Read WorkOS's incident analysis of the October 20 service disruption

Context

Identity services fail differently from ordinary application services because they sit on the path to every other control in the stack. When regional infrastructure, database credential retrieval, or feature-flag dependencies break, access workflows can stall even if the application itself is otherwise healthy. This incident is a reminder that IAM availability is a governance issue, not just an infrastructure one.

WorkOS’s post shows a familiar pattern for modern identity platforms: the first outage came from cloud-region disruption, while the second came from third-party dependency fragility and degraded observability. That combination matters to IAM, NHI, and platform teams because it exposes how much trust is placed in hidden control-plane dependencies that are rarely tested under failure.

The practical question is no longer whether authentication works in the happy path. It is whether identity products can continue to authenticate, degrade, and recover when regions, proxies, and upstream services fail at the same time. That is the availability model practitioners should be designing for.

Key questions

Q: What breaks when identity services depend on a single cloud region?

A: When identity services depend on a single cloud region, the login path can fail even if the rest of the application is intact. New connections, credential retrieval, and backend health checks may collapse together, which stops fresh authentication while leaving existing sessions partially unaffected. The result is a brittle access layer that cannot absorb regional outage without service interruption.

Q: Why do third-party dependencies make auth outages worse?

A: Third-party dependencies make auth outages worse because they expand the failure domain beyond the identity platform itself. If hosting, feature flags, or proxies use untested default behaviors, requests can hang instead of failing safely. That turns a single upstream incident into a prolonged authentication event and slows recovery because operators must diagnose multiple layers at once.

Q: How do teams know if identity failover is actually working?

A: Teams know failover is working when sign-in, hosted pages, and admin functions recover cleanly during controlled dependency loss, not just when diagrams show a secondary region exists. The test should include degraded observability, failed deployments, and upstream service unavailability. If the fallback path cannot execute under those conditions, it is not operational resilience.

Q: Who is accountable when identity availability fails across vendors?

A: Accountability stays with the service owner, even when the outage is triggered by cloud or third-party failure. Identity governance does not end at the vendor boundary, because users experience one access service, not a chain of contracts. Teams must define ownership for dependency risk, recovery testing, and communication so outages can be managed as an access problem, not only an infrastructure event.

Technical breakdown

Regional failure and IAM credential retrieval

The first outage began when AWS us-east-1 experienced region-wide failure, which prevented WorkOS’s database proxy from retrieving database credentials. That broke new connections and caused backend health checks to fail, leading to pod recycling that could not succeed. In identity systems, credential retrieval is not a side function. If the control plane that brokers database or secret access fails, the service may stay partially up while new sign-ins collapse.

Practical implication: separate identity availability from single-region credential and proxy dependencies.

Feature flags, SDK defaults, and hanging auth requests

The second outage was driven by an integration with a feature-flag provider whose default SDK behavior was not resilient to upstream unavailability. Instead of failing cleanly, requests hung and produced 504 responses, which broke AuthKit page rendering and sign-in flows. This is a common failure mode in identity systems that depend on hidden runtime decisions from third-party SDKs. A resilient auth stack needs deterministic fallback behavior when dependency state is unknown.

Practical implication: validate timeout, fallback, and circuit-breaker behavior for every identity-critical SDK.

Why multi-region failover still failed

WorkOS attempted to redeploy AuthKit to a secondary region, but deployment failures and degraded observability delayed recovery and made the mitigation ineffective. That shows failover is not the same as resilience. If observability is impaired and deployment mechanics are brittle, the backup path can fail at the exact moment it is needed. Identity services need recovery paths that are already proven under live dependency loss, not only documented in architecture diagrams.

Practical implication: test multi-region recovery with production-like failure conditions and degraded monitoring.

Shai Hulud npm malware campaign — Shai Hulud campaign: npm malware exposed secrets on GitHub.
Reviewdog GitHub Action supply chain attack — reviewdog/action-setup GitHub Action supply chain attack exposed secrets.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Identity availability is now a control-plane governance problem, not an uptime metric. This incident shows that authentication, directory sync, and admin access depend on a chain of upstream services that must all behave during partial failure. When one region and then multiple third parties falter, the identity layer becomes the first business dependency to fail. Practitioners should treat identity service resilience as part of access governance, because unavailable auth is effectively denied access at scale.

Feature-flag dependency fragility is a named control gap, not just an infrastructure inconvenience. The failure here was not that a feature-flag provider existed, but that the integration relied on default SDK behavior that did not degrade safely when the provider became unavailable. That means the access path inherited hidden assumptions about upstream responsiveness. The implication is that runtime control dependencies for identity products must be assessed as part of service design, not discovered during incident response.

Regional failover without observability and deployment integrity is resilient on paper only. WorkOS had a mitigation path, but degraded observability and deployment failures delayed restoration and made the fallback ineffective. This is the gap practitioners often miss: hot standby, secondary regions, and recovery plans do not equal continuity if the repair path itself is fragile. IAM and platform teams should evaluate whether their recovery model can actually execute under concurrent provider failure.

Identity teams should stop assuming session continuity means service continuity. Existing sessions remained live while sign-in, hosted pages, and some backend functions failed, which split user experience into partially working and fully unavailable states. That distinction matters because many programmes track authentication success while ignoring the resilience of the surrounding identity experience. The practitioner conclusion is that auth session stability is only one layer of availability, not the whole control.

Cross-provider dependence creates availability debt that compounds during incidents. The outage combined cloud-region failure, hosting-provider disruption, and feature-flag instability, which turned one incident into a prolonged identity service event. This is the operational equivalent of concentration risk in identity architecture. Teams should evaluate where authentication, hosted sign-in, and recovery tooling all depend on the same failure domain before the next incident exposes it.

From our research:
The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap.
That gap aligns with Ultimate Guide to NHIs , The NHI Market, where identity sprawl and fragmented control make resilience harder to sustain.

What this signals

Identity service resilience is becoming a governance requirement, not an SRE enhancement. When auth fails, the business loses the ability to recover access quickly, which is why IAM teams should track region diversity, fallback logic, and recovery execution alongside policy design. NIST Cybersecurity Framework 2.0 remains a useful way to frame that resilience as protect, detect, respond, and recover.

Control-plane fragility creates hidden availability debt for both NHI and human access programmes. A proxy, feature-flag service, or hosted login layer can become the weakest point in the access chain if it is not tested under concurrent failure. The practical signal is simple: if recovery depends on a perfect operator path, the identity programme is more brittle than the diagrams suggest.

Multi-region design only matters if the runtime dependencies can survive the move. Teams should now evaluate whether their identity stack can degrade gracefully when upstream services are partially unavailable and whether their observability still works during failover. That is the point at which a resilience architecture becomes an operational control, not a promise.

For practitioners

Map every identity-critical dependency chain Document the services behind sign-in, session refresh, hosted pages, credential retrieval, and admin functions so you can see which upstreams share the same failure domain. Include third-party hosting, feature flags, database proxies, and observability tools in the same map.
Test fallback behavior for identity SDKs Review timeout settings, circuit breakers, and fail-closed versus fail-open defaults for any SDK used in authentication or hosted identity flows. Force controlled upstream unavailability in staging to verify that the application returns clean errors rather than hanging requests.
Validate multi-region recovery under degraded monitoring Run failover exercises where observability is partially missing and deployment pipelines are slowed or broken, because that is when recovery logic is most likely to fail. Measure whether the identity service can restore login and session functions without operator guesswork.
Separate session continuity from sign-in availability Track existing session health, refresh success, hosted page rendering, and new authentication entry as distinct service states. That separation shows whether users are truly able to regain access or only remain authenticated after prior login success.

Key takeaways

This outage shows that identity services are only as resilient as their weakest upstream dependency.
The scale of impact reached 100% failure on AuthKit and roughly 70% request failure at peak, which is enough to interrupt access flows at the business edge.
Teams should design for clean degradation, regional failover, and recovery under partial observability, because auth continuity is now an availability control.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

NIST CSF 2.0, NIST Zero Trust (SP 800-207) and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST CSF 2.0	RC.RP-1	Recovery planning is central because the incident exposed brittle failover execution.
NIST Zero Trust (SP 800-207)	PR.AC-1	Auth availability depends on trustworthy access paths and resilient policy enforcement.
NIST CSF 2.0	PR.PT-5	Protective technology must handle cloud and third-party failures without cascading auth outage.

Introduce timeouts, circuit breakers, and fallback logic for identity-critical integrations.

Key terms

Identity Availability: Identity availability is the ability of authentication, authorisation, and associated access services to remain usable during normal and degraded conditions. In practice, it includes sign-in, session refresh, hosted login, and recovery paths, not just whether a token can be issued when everything is healthy.
Failure Domain: A failure domain is the set of systems that can fail together because they share an upstream dependency, region, provider, or operational mechanism. In identity programmes, the important question is not only what the service depends on, but how many access paths collapse when one dependency is lost.
Graceful Degradation: Graceful degradation means a service continues to provide partial, predictable function when a dependency becomes unavailable. For identity systems, that might mean returning clean errors, preserving existing sessions, or falling back to cached state instead of hanging requests or breaking the login experience entirely.
Hot Standby: Hot standby is a recovery design in which a secondary environment is kept ready to take traffic with minimal delay. For identity services, the value depends on more than infrastructure presence. The standby path must also work when observability is impaired and deployment tooling is under stress.

Deepen your knowledge

Identity service resilience and dependency governance are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building a similar control model for sign-in and hosted access services, it is worth exploring.

This post draws on content published by WorkOS: service disruption on October 20, 2025 and the response plan. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-10-23.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org