TL;DR: Stack-aware health checks lifted client query success from near zero to 99.9% during scaling, restarts, and rollouts by checking readiness across Kubernetes, Docker, and systemd instead of relying on basic process liveness, according to Pomerium. The deeper lesson is that availability controls fail when they measure uptime but not operational readiness.
NHIMG editorial — based on content published by Pomerium: Designing Smarter Health Checks for Pomerium
Questions worth separating out
Q: How should teams separate readiness from liveness in production services?
A: Teams should use liveness to answer whether a process is running and readiness to answer whether the service can safely receive traffic.
Q: Why do basic health checks fail in stateful orchestration environments?
A: Basic checks fail because they usually measure only the local process, not whether the service has finished initialization or synchronized state across components.
Q: How can security teams know whether health checks are actually working?
A: The signal is whether routing decisions match the service’s real ability to handle requests during restarts, rollouts, and scaling events.
Practitioner guidance
- Define readiness separately from liveness Write explicit readiness criteria for the service path that receives traffic, then keep liveness limited to process survival and crash detection.
- Map every subsystem to a local state signal Have authentication, authorization, storage, and proxy layers report their own state instead of inferring health from one shared endpoint.
- Align probe semantics across runtimes Translate the same internal state model into Kubernetes probes, Docker health checks, and systemd notifications so operators do not get different answers from different runtimes.
What's in the full article
Pomerium's full blog post covers the implementation detail this analysis intentionally leaves for the source:
- The exact state-tracking patterns used to aggregate readiness across authentication, authorization, storage, and proxy components
- The push versus pull design trade-offs behind the internal health model and why eventual consistency shaped the final choice
- How Kubernetes, Docker, and systemd were each mapped to the service’s internal readiness semantics
- The final implementation details in pkg/health for teams that want to study the code path directly
👉 Read Pomerium's analysis of smarter health checks for zero-trust proxies →
Health checks and readiness gaps: what IAM teams need to know?
Explore further
Readiness drift is the real failure mode, not service uptime. A proxy or control plane can be alive, reachable, and still incapable of safely handling traffic because its internal state has not fully converged. That is the governance problem this design pattern exposes: availability checks that only confirm process survival create a false sense of operational readiness. Practitioners should treat readiness drift as a first-class operational risk.
A few things that frame the scale:
- Only 5.7% of organisations have full visibility into their service accounts, according to the Ultimate Guide to NHIs.
- 79% of organisations have experienced secrets leaks, and 77% of those incidents resulted in tangible damage.
A question worth separating out:
Q: What is the difference between Kubernetes probes and systemd readiness signals?
A: Kubernetes separates startup, readiness, and liveness, while systemd relies on signals such as READY=1 and watchdog notifications to describe service state. They both communicate lifecycle, but they do so with different levels of granularity. Teams need an internal readiness model that can be mapped cleanly onto either runtime without changing the meaning of healthy service state.
👉 Read our full editorial: Smarter health checks expose the readiness gap in zero trust proxies