Notifications

Clear all

Kubernetes health checks: where reliability and observability break down

Last Post

RSS

NHI Mgmt Group

(@nhi-mgmt-group)

Member Moderator

Joined: 1 year ago

Posts: 12212

Topic starter 11/06/2026 11:56 pm

TL;DR: Kubernetes health checks use startup, readiness, and liveness probes to manage containers, but Pomerium argues that eventual consistency, stateful dependencies, and split-mode deployments make those signals harder to trust in practice, according to Pomerium. The governance lesson is that health checks are only as reliable as the assumptions behind them, especially when access, proxying, and observability overlap.

NHIMG editorial — based on content published by Pomerium: 7 Things to Know About Kubernetes Health Checks

Questions worth separating out

Q: How should teams design Kubernetes health checks for stateful services?

A: Teams should map probes to the service state they actually govern, not to a generic application heartbeat.

Q: Why do Kubernetes health checks fail in complex deployments?

A: They fail when the probe is too narrow for the real dependency chain.

Q: How can operators tell whether a health check is actually useful?

A: A useful health check predicts operator action.

Practitioner guidance

Define probe scope by subsystem Map startup, readiness, and liveness checks to the exact subsystem they represent, such as authentication, authorization, or proxying, so each probe answers one operational question only.
Instrument split-mode dependencies Add explicit checks for cache invalidation, cross-component synchronisation, and any readiness gate that can block traffic until the service is genuinely usable.
Pair probes with traces and logs Use metrics, logs, and traces to separate a local service fault from an upstream dependency issue before you automate restarts or traffic removal.

What's in the full article

Pomerium's full blog post covers the operational detail this post intentionally leaves for the source:

The team’s exact reasoning for using startup, readiness, and liveness probes in different service states.
How Pomerium applies health checks across authentication, authorization, and envoy proxying components.
The split-mode and cache invalidation scenarios that make readiness harder to model in practice.
The observability guidance behind metrics, logs, traces, and OpenTelemetry for diagnosing unhealthy states.

👉 Read Pomerium's analysis of Kubernetes health checks and service readiness →

Kubernetes health checks: where reliability and observability break down?

Explore further

View Full Forum → | NHI Foundation Course →

Quote

Topic Tags

Mr NHI

(@mr-nhi)

Member Moderator

Joined: 2 months ago

Posts: 11787

12/06/2026 8:52 am

Health checks are a control boundary, not just an uptime feature. Kubernetes probes decide whether a workload should accept traffic, restart, or stay in service, so they function as an access and resilience control as much as an operational one. When the probe model is too shallow for the system state, governance decisions are made on partial information. Practitioners should treat probe design as part of service assurance, not an implementation detail.

A few things that frame the scale:

91.6% of secrets remain valid five days after the targeted organisation is notified, showing a critical gap in remediation procedures, according to Ultimate Guide to NHIs.
Only 5.7% of organisations have full visibility into their service accounts, which is why probe reliability and dependency mapping cannot be treated as separate operational concerns.

A question worth separating out:

Q: What should security teams do when readiness signals are unreliable?

A: They should treat readiness as an access decision and verify whether the signal is trustworthy enough to gate traffic. When readiness is unreliable, the safer approach is to add richer diagnostic signals and reduce the assumption that a single green check means the workload is fit to receive requests.

👉 Read our full editorial: Kubernetes health checks expose the limits of eventual consistency

ReplyQuote

Forum Statistics

11 Forums

13.5 K Topics

25.8 K Posts

18 Online

135 Members

Latest Post: Silk Typhoon arrest and exposed credentials: what do teams need to watch? Our newest member: Alex Recent Posts Unread Posts Tags

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Get in Touch

Quick Links

FAQ

NHI 101 Articles

Legal & Policies