Smarter health checks expose the readiness gap in zero trust proxies

By NHI Mgmt Group Editorial TeamPublished 2025-10-30Domain: Best PracticesSource: Pomerium

TL;DR: Stack-aware health checks lifted client query success from near zero to 99.9% during scaling, restarts, and rollouts by checking readiness across Kubernetes, Docker, and systemd instead of relying on basic process liveness, according to Pomerium. The deeper lesson is that availability controls fail when they measure uptime but not operational readiness.

At a glance

What this is: This is a design analysis of stack-aware health checks that distinguishes process liveness from true readiness in a zero-trust proxy.

Why it matters: It matters because IAM and NHI programmes increasingly depend on orchestration signals that must reflect actual service state, not just container uptime or a passing probe.

By the numbers:

The design moved client query success from nearly 0% to 99.9% during horizontal scaling, application restarts, and rollouts.
At around 1,000 requests per second, downtime lasted roughly 30 seconds before the smarter health checks were added.

👉 Read Pomerium's analysis of smarter health checks for zero-trust proxies

Context

Readiness is not the same as liveness. In this post, Pomerium shows why a service can be technically up while still failing requests because the underlying components have not fully initialized, which is a common failure mode in stateful, zero-trust control planes and orchestration-driven environments.

For identity and access teams, the lesson is broader than Kubernetes. Any programme that depends on health signals, deployment gates, or automated failover needs those signals to reflect the actual operating state of the system, especially where access decisions, session validation, or proxy enforcement depend on synchronized backend state.

Key questions

Q: How should teams separate readiness from liveness in production services?

A: Teams should use liveness to answer whether a process is running and readiness to answer whether the service can safely receive traffic. If initialization, configuration sync, or downstream dependencies are incomplete, the service should stay out of rotation even when it looks healthy to a basic process check. That avoids routing traffic into partially initialized systems.

Q: Why do basic health checks fail in stateful orchestration environments?

A: Basic checks fail because they usually measure only the local process, not whether the service has finished initialization or synchronized state across components. In stateful systems, a passing probe can hide stale configuration, incomplete startup, or missing dependencies. The result is traffic being sent to a system that is technically alive but not operationally ready.

Q: How can security teams know whether health checks are actually working?

A: The signal is whether routing decisions match the service’s real ability to handle requests during restarts, rollouts, and scaling events. If success rates collapse or traffic reaches half-initialized instances, the health model is too coarse. A working system should keep unsafe instances out of rotation until the control path is genuinely ready.

Q: What is the difference between Kubernetes probes and systemd readiness signals?

A: Kubernetes separates startup, readiness, and liveness, while systemd relies on signals such as READY=1 and watchdog notifications to describe service state. They both communicate lifecycle, but they do so with different levels of granularity. Teams need an internal readiness model that can be mapped cleanly onto either runtime without changing the meaning of healthy service state.

Technical breakdown

Liveness probes vs readiness probes in orchestration

Liveness probes answer whether a process is still functioning, while readiness probes answer whether it should receive traffic. In Kubernetes, that distinction matters because a container can be alive, responding on its port, and still be unsafe to route to if internal dependencies, caches, or synchronization steps are incomplete. Pomerium’s problem was that a generic health endpoint only measured survival, not operational suitability. That gap is common in distributed systems that mix startup latency, configuration propagation, and request-time state checks. Practical implication: model health around service readiness, not process existence.

Practical implication: separate process survival checks from traffic eligibility checks so orchestration does not route traffic to half-initialized services.

Push and pull health state collection

Health state can be gathered by pulling state on demand or by pushing state updates into an aggregator. Pull-based checks are simple and localized because each probe executes its own logic when queried. Push-based checks centralize state and can enforce stricter ordering, but they also depend on every subsystem reporting accurately and in sequence. Pomerium found that eventual consistency and cross-component state synchronization made pure pull checks too limited, while push aggregation better matched its architecture. Practical implication: choose the collection model that matches how state actually changes across services, not the one that is easiest to sketch on a whiteboard.

Practical implication: align the health-state collection model with the service’s real consistency pattern and failure surface.

Stack-aware health checks across Kubernetes, Docker, and systemd

Different runtimes expose health differently, so a single generic probe often hides important lifecycle detail. Kubernetes distinguishes startup, readiness, and liveness. Docker exposes a single HEALTHCHECK result, which is less expressive and pushes more responsibility into the application. systemd uses READY=1 and sd_notify signals to communicate initialization and ongoing status. Pomerium’s design therefore had to normalize these runtime differences into a single internal state model while still preserving lifecycle nuance. Practical implication: design health semantics once, then map them carefully to each runtime rather than assuming platform-native probes mean the same thing.

Practical implication: normalise health semantics centrally, then adapt them to each orchestration runtime’s lifecycle signals.

NHI Mgmt Group analysis

Readiness drift is the real failure mode, not service uptime. A proxy or control plane can be alive, reachable, and still incapable of safely handling traffic because its internal state has not fully converged. That is the governance problem this design pattern exposes: availability checks that only confirm process survival create a false sense of operational readiness. Practitioners should treat readiness drift as a first-class operational risk.

State-local health reporting beats global health assumptions. When each subsystem reports its own state, failures stay visible and traceable instead of being collapsed into one opaque pass or fail result. That is especially relevant in systems with authentication, authorization, storage, and proxy layers that can each be healthy for different reasons. Practitioners should avoid aggregating away the signals that explain why a service is not safe to serve.

Cross-runtime health semantics need a single governance model. Kubernetes, Docker, and systemd all describe lifecycle differently, but the operator needs one internal definition of what safe readiness means. Without that, teams end up with inconsistent probe behavior, uneven rollback decisions, and unreliable failover. Practitioners should standardize the state model first and let the runtime-specific probes inherit from it.

Stack-aware health checks create identity-adjacent control integrity. In zero-trust proxies, health is not just an infrastructure concern because traffic routing, session validation, and authorization enforcement all depend on synchronized service state. If the proxy is serving before those layers are aligned, identity assurance breaks at the point of enforcement. Practitioners should align health logic with the control path, not the container lifecycle alone.

From our research:
Only 5.7% of organisations have full visibility into their service accounts, according to the Ultimate Guide to NHIs.
79% of organisations have experienced secrets leaks, and 77% of those incidents resulted in tangible damage.
For a broader governance lens, the Ultimate Guide to NHIs shows that 97% of NHIs carry excessive privileges.

What this signals

Readiness logic is becoming part of identity-adjacent governance. As more enforcement points depend on synchronized service state, teams need to treat probe semantics as an operational control, not just a deployment convenience. The next failure mode is not simply an outage. It is a control path that routes or authorizes before the underlying service state has converged.

The governance pattern here is a useful reminder for zero-trust programmes. If health signals are too coarse, they can create the same false confidence that weak access reviews create in identity programmes: the system looks governed, but the underlying state is not yet safe.

Teams that operate across Kubernetes, Docker, and systemd should expect more pressure to standardize lifecycle semantics, especially where proxying, authorization, or session handling depends on backend convergence. That is the point at which health checking becomes part of service assurance rather than just monitoring.

For practitioners

Define readiness separately from liveness Write explicit readiness criteria for the service path that receives traffic, then keep liveness limited to process survival and crash detection. If a service can be alive but still unsafe to route to, readiness must fail until the state is fully converged.
Map every subsystem to a local state signal Have authentication, authorization, storage, and proxy layers report their own state instead of inferring health from one shared endpoint. This makes failed initialization, stale configuration, and partial sync visible before traffic is accepted.
Align probe semantics across runtimes Translate the same internal state model into Kubernetes probes, Docker health checks, and systemd notifications so operators do not get different answers from different runtimes. Use one source of truth for health, then adapt the output to each platform.

Key takeaways

The core issue is not whether a service is alive, but whether it is ready to handle traffic without breaking the control path.
Pomerium’s experience shows that coarse health checks can hide initialization and synchronization gaps that collapse request success rates during rollout events.
Teams should define readiness centrally, then map it to orchestration-specific signals so traffic only reaches fully converged services.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

NIST CSF 2.0, NIST Zero Trust (SP 800-207) and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST CSF 2.0	PR.IP-4	Health checks support secure change and state validation during deployment.
NIST Zero Trust (SP 800-207)	PR.AC-1	Zero trust enforcement depends on accurate service state before access is granted.
NIST CSF 2.0	DE.CM-1	Continuous monitoring should distinguish operational readiness from mere process uptime.

Tie readiness checks to deployment gates so incomplete services never receive production traffic.

Key terms

Readiness probe: A readiness probe checks whether a service can safely receive traffic, not merely whether it is running. In distributed systems, it should reflect initialization, dependency sync, and control-path integrity so orchestration does not route requests into a partially prepared instance.
Liveness probe: A liveness probe checks whether a process is still functioning and should be restarted if it is not. It is useful for crash detection, but it does not prove the service is ready to serve requests or that its internal state is consistent.
State convergence: State convergence is the point at which the service’s internal components agree on the current operational state. In health-check design, convergence matters because traffic should only flow when configuration, synchronization, and runtime status are aligned enough to support correct behaviour.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity lifecycle are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.

This post draws on content published by Pomerium: Designing Smarter Health Checks for Pomerium. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-10-30.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org