Notifications

Clear all

Health checks and readiness gaps: what IAM teams need to know

Last Post

RSS

NHI Mgmt Group

(@nhi-mgmt-group)

Member Moderator

Joined: 1 year ago

Posts: 12212

Topic starter 11/06/2026 11:13 pm

TL;DR: Stack-aware health checks lifted client query success from near zero to 99.9% during scaling, restarts, and rollouts by checking readiness across Kubernetes, Docker, and systemd instead of relying on basic process liveness, according to Pomerium. The deeper lesson is that availability controls fail when they measure uptime but not operational readiness.

NHIMG editorial — based on content published by Pomerium: Designing Smarter Health Checks for Pomerium

Questions worth separating out

Q: How should teams separate readiness from liveness in production services?

A: Teams should use liveness to answer whether a process is running and readiness to answer whether the service can safely receive traffic.

Q: Why do basic health checks fail in stateful orchestration environments?

A: Basic checks fail because they usually measure only the local process, not whether the service has finished initialization or synchronized state across components.

Q: How can security teams know whether health checks are actually working?

A: The signal is whether routing decisions match the service’s real ability to handle requests during restarts, rollouts, and scaling events.

Practitioner guidance

Define readiness separately from liveness Write explicit readiness criteria for the service path that receives traffic, then keep liveness limited to process survival and crash detection.
Map every subsystem to a local state signal Have authentication, authorization, storage, and proxy layers report their own state instead of inferring health from one shared endpoint.
Align probe semantics across runtimes Translate the same internal state model into Kubernetes probes, Docker health checks, and systemd notifications so operators do not get different answers from different runtimes.

What's in the full article

Pomerium's full blog post covers the implementation detail this analysis intentionally leaves for the source:

The exact state-tracking patterns used to aggregate readiness across authentication, authorization, storage, and proxy components
The push versus pull design trade-offs behind the internal health model and why eventual consistency shaped the final choice
How Kubernetes, Docker, and systemd were each mapped to the service’s internal readiness semantics
The final implementation details in pkg/health for teams that want to study the code path directly

👉 Read Pomerium's analysis of smarter health checks for zero-trust proxies →

Health checks and readiness gaps: what IAM teams need to know?

Explore further

View Full Forum → | NHI Foundation Course →

Quote

Topic Tags

Mr NHI

(@mr-nhi)

Member Moderator

Joined: 2 months ago

Posts: 11787

12/06/2026 7:48 am

Readiness drift is the real failure mode, not service uptime. A proxy or control plane can be alive, reachable, and still incapable of safely handling traffic because its internal state has not fully converged. That is the governance problem this design pattern exposes: availability checks that only confirm process survival create a false sense of operational readiness. Practitioners should treat readiness drift as a first-class operational risk.

A few things that frame the scale:

Only 5.7% of organisations have full visibility into their service accounts, according to the Ultimate Guide to NHIs.
79% of organisations have experienced secrets leaks, and 77% of those incidents resulted in tangible damage.

A question worth separating out:

Q: What is the difference between Kubernetes probes and systemd readiness signals?

A: Kubernetes separates startup, readiness, and liveness, while systemd relies on signals such as READY=1 and watchdog notifications to describe service state. They both communicate lifecycle, but they do so with different levels of granularity. Teams need an internal readiness model that can be mapped cleanly onto either runtime without changing the meaning of healthy service state.

👉 Read our full editorial: Smarter health checks expose the readiness gap in zero trust proxies

ReplyQuote

Forum Statistics

11 Forums

13.5 K Topics

25.8 K Posts

19 Online

135 Members

Latest Post: Silk Typhoon arrest and exposed credentials: what do teams need to watch? Our newest member: Alex Recent Posts Unread Posts Tags

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Get in Touch

Quick Links

FAQ

NHI 101 Articles

Legal & Policies