Kubernetes health checks expose the limits of eventual consistency

By NHI Mgmt Group Editorial TeamPublished 2025-09-09Domain: Best PracticesSource: Pomerium

TL;DR: Kubernetes health checks use startup, readiness, and liveness probes to manage containers, but Pomerium argues that eventual consistency, stateful dependencies, and split-mode deployments make those signals harder to trust in practice, according to Pomerium. The governance lesson is that health checks are only as reliable as the assumptions behind them, especially when access, proxying, and observability overlap.

At a glance

What this is: This is Pomerium’s analysis of why Kubernetes health checks are difficult to configure well and how probe design affects reliability, observability, and state handling.

Why it matters: It matters to IAM and security teams because control-plane health signals often decide whether access flows continue, fail closed, or restart, which affects both NHI and broader platform governance.

👉 Read Pomerium's analysis of Kubernetes health checks and service readiness

Context

Kubernetes health checks are the mechanism that tells the orchestrator whether a workload should receive traffic, be restarted, or keep running. In practice, those checks become unreliable when they are asked to represent complex internal states with a small set of external signals, especially in systems that depend on authentication, authorization, and proxy behaviour.

For identity and security teams, the important question is not whether health checks exist but whether they align with the service lifecycle, dependency chain, and failure mode of the workload being governed. That makes them relevant to operational IAM, workload identity, and zero-trust enforcement because the same probe can either preserve resilience or mask a control failure.

Key questions

Q: How should teams design Kubernetes health checks for stateful services?

A: Teams should map probes to the service state they actually govern, not to a generic application heartbeat. Statefulness means the service can be partly healthy, partly degraded, or temporarily unavailable during transition. Effective probe design distinguishes those states and avoids treating one responding endpoint as proof that the whole workload is safe to serve.

Q: Why do Kubernetes health checks fail in complex deployments?

A: They fail when the probe is too narrow for the real dependency chain. A pod can answer the health endpoint while authentication, authorization, cache coherence, or downstream connectivity is already broken. In complex deployments, the problem is not the existence of probes but the mismatch between probe scope and actual service behaviour.

Q: How can operators tell whether a health check is actually useful?

A: A useful health check predicts operator action. If a failed probe consistently identifies the right failure mode, triggers the right recovery response, and correlates with observability data, it is useful. If it restarts healthy services or misses real outages, it is measuring the wrong thing and should be redesigned.

Q: What should security teams do when readiness signals are unreliable?

A: They should treat readiness as an access decision and verify whether the signal is trustworthy enough to gate traffic. When readiness is unreliable, the safer approach is to add richer diagnostic signals and reduce the assumption that a single green check means the workload is fit to receive requests.

Technical breakdown

Startup, readiness, and liveness probes

Kubernetes uses three probe types to represent different phases of application health. Startup probes cover initialization, readiness probes decide whether traffic can flow, and liveness probes decide whether a container should be restarted. The model is simple for stateless workloads, but it becomes harder when internal subsystems fail independently, when cache state matters, or when one component can still answer while another is degraded. That is why a single green check can conceal partial failure rather than prove service health.

Practical implication: Treat each probe as a governance decision, not a generic uptime signal, and map it to the exact service state it is meant to represent.

Eventual consistency and repeated polling

Kubernetes health checks are continuously polled and evaluated against failure thresholds. That means the system is not looking for a perfect snapshot, but for repeated evidence that a workload is not behaving as expected. This works well when state is stable, but it can produce false positives or delayed recovery when the application transitions through temporary states. In stateful or distributed services, the probe may report failure before the system has genuinely broken, or miss a failure because the wrong dependency is being checked.

Practical implication: Use probe thresholds and timing only after you understand the service’s normal state transitions and failure recovery pattern.

Readiness gates, external signals, and observability

When built-in probes are too coarse, Kubernetes allows more context through readiness gates, custom probes, and external signals. That is especially useful when health depends on downstream authentication, proxy behaviour, or cross-service coordination. Pomerium also points to observability as the next layer, using metrics, logs, and traces to explain whether a failure originates in the service or outside it. The core point is that health checks alone are not a complete diagnostic model for complex platforms.

Practical implication: Combine probes with observability signals so operators can distinguish a true service failure from an upstream dependency issue.

NHI Mgmt Group analysis

Health checks are a control boundary, not just an uptime feature. Kubernetes probes decide whether a workload should accept traffic, restart, or stay in service, so they function as an access and resilience control as much as an operational one. When the probe model is too shallow for the system state, governance decisions are made on partial information. Practitioners should treat probe design as part of service assurance, not an implementation detail.

The failure mode here is probe overconfidence. A system can appear healthy because the checked endpoint responds, while authentication, authorization, or proxy dependency chains are already degraded. That is especially dangerous in split-mode or stateful deployments where the visible service surface is not the whole service reality. The practical conclusion is that teams need to align probe scope with actual subsystem risk.

Service lifecycle thinking is the right named concept for this problem. Created, starting, running, and terminating are different states, and Kubernetes does not infer those transitions for you. When teams collapse all of those states into one readiness signal, they lose the ability to express nuance in failure handling. Practitioners should make lifecycle state explicit wherever a workload can fail in more than one way.

Observability is the second control plane for health governance. Metrics, logs, and traces do not replace probes, but they stop health checks from becoming a binary illusion. That matters because the operator needs to know not just that something failed, but whether the issue sits in the app, the proxy layer, or a downstream dependency. Practitioners should pair health checks with diagnostic evidence before they rely on automated recovery.

This topic matters for identity governance because access continuity depends on trustworthy service state. In zero-trust environments, a workload that is not truly ready should not receive access, and a workload that is partially failed should not be treated as fully trusted. The governance implication is simple: if health signals are wrong, access decisions are wrong. Practitioners should tie health signalling to the control path that actually consumes it.

From our research:
91.6% of secrets remain valid five days after the targeted organisation is notified, showing a critical gap in remediation procedures, according to Ultimate Guide to NHIs.
Only 5.7% of organisations have full visibility into their service accounts, which is why probe reliability and dependency mapping cannot be treated as separate operational concerns.
For the broader control picture, see Ultimate Guide to NHIs for visibility, rotation, and offboarding patterns that make health signalling more trustworthy.

What this signals

Service lifecycle: Kubernetes health checks only work when teams can distinguish created, starting, running, and terminating states. If those transitions are collapsed into a single readiness outcome, operators get confidence without clarity, and automated recovery starts acting on incomplete evidence.

The next programme-level shift is to connect probe design with identity governance, especially where proxy services and workload identity are part of the access path. That means health checks, observability, and access control need to be evaluated as one chain rather than three separate tools.

If your environment depends on stateful services, readiness gates and diagnostic telemetry become part of operational trust. The practical signal is simple: any service that can change access outcomes should also have a traceable explanation for why it was considered healthy.

For practitioners

Define probe scope by subsystem Map startup, readiness, and liveness checks to the exact subsystem they represent, such as authentication, authorization, or proxying, so each probe answers one operational question only.
Instrument split-mode dependencies Add explicit checks for cache invalidation, cross-component synchronisation, and any readiness gate that can block traffic until the service is genuinely usable.
Pair probes with traces and logs Use metrics, logs, and traces to separate a local service fault from an upstream dependency issue before you automate restarts or traffic removal.
Review health logic after lifecycle changes Reassess probe thresholds whenever deployment topology, state handling, or traffic routing changes, because a health rule that worked in one deployment mode can misclassify another.

Key takeaways

Kubernetes health checks are governance controls as much as uptime signals, because they determine whether a workload receives traffic or gets restarted.
The main risk is probe overconfidence, where a service looks healthy even though a dependency chain, cache state, or access layer is already failing.
Teams should align probes with lifecycle state, add observability, and verify that health signals are trustworthy enough to drive access decisions.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

NIST CSF 2.0, NIST Zero Trust (SP 800-207) and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST CSF 2.0	PR.PT-5	Health checks support resilient service behaviour and controlled recovery.
NIST Zero Trust (SP 800-207)		Readiness signals influence whether a workload is trusted to receive traffic.
NIST CSF 2.0	DE.CM-8	Observability is needed to distinguish service faults from upstream dependency issues.

Tie probe behaviour to service protection goals and verify restart logic against actual failure modes.

Key terms

Startup Probe: A startup probe is the Kubernetes check used to determine whether a container has finished initialization. It matters when boot time is long or variable, because the workload should not be judged against readiness or liveness before it is actually able to start correctly.
Readiness Probe: A readiness probe tells Kubernetes whether a workload should receive traffic. It is an access decision signal, not a proof of full health, so it must reflect the exact conditions that make the service safe and usable for real requests.
Liveness Probe: A liveness probe tells Kubernetes whether a container should be restarted. It is useful for detecting a deadlock or hung process, but it can create instability if it is too sensitive or if it mistakes temporary transition states for a true failure.
Readiness Gate: A readiness gate adds extra conditions that must be met before a workload becomes ready. It is useful when built-in probes are too coarse, because it lets operators include external signals such as dependency status, proxy health, or custom application logic.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.

This post draws on content published by Pomerium: 7 Things to Know About Kubernetes Health Checks. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-09-09.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org