Subscribe to the Non-Human & AI Identity Journal

Why do observability tools fail to prevent lateral movement in workloads?

Observability tools fail because they can show scanning, token use, and suspicious connections without stopping them. Lateral movement is prevented only when the runtime can deny access to internal services or databases. If the secret is still valid, the log entry arrives too late to change the outcome.

Why This Matters for Security Teams

Observability is essential for detection, but it is not a prevention control. In workloads, lateral movement often begins with a valid secret, a service token, or an over-permissive workload identity that can already reach internal APIs, databases, or admin endpoints. Once that access exists, logs can confirm the path of abuse, but they cannot stop the first successful hop. That is why runtime authorization and short-lived identity matter more than post-event visibility.

This gap shows up constantly in machine identity programs. NHIMG notes that organisations maintain an average of 6 distinct secrets manager instances, which creates fragmentation that weakens centralised control in the State of Secrets in AppSec. The problem is not just missing telemetry. It is that observability tools sit after the decision point, while the attacker or compromised workload is acting in real time. Current guidance from SPIFFE workload identity specification reinforces this distinction by focusing on cryptographic workload identity, not passive inspection alone.

In practice, many security teams discover lateral movement only after a service account has already authenticated successfully to the next system.

How It Works in Practice

Stopping lateral movement requires controls that evaluate every request before access is granted. For autonomous workloads, that usually means pairing workload identity with runtime policy. The workload proves what it is through cryptographic identity, then the platform decides whether the specific action is allowed at that moment. This is a different model from observability, which records what already happened.

A practical design often includes:

  • Workload identity issued per service, task, or pod, rather than shared across environments, using patterns described in the Guide to SPIFFE and SPIRE.

  • Short-lived credentials or tokens with narrow scope, so compromise does not yield durable reuse.

  • Policy evaluation at request time, using policy-as-code to decide whether a workload can query a database, call an internal API, or assume another role.

  • Network and service controls that deny by default, instead of merely alerting when unexpected east-west traffic occurs.

This matters because the attacker does not need to “break” observability to move laterally. They only need a valid path that remains open long enough. NHIMG’s research on machine identity shows the scale problem clearly: 69% of organisations now have more machine identities than human ones, and 57% lack a complete inventory. That is the environment where static allowlists and long-lived secrets become easy to reuse across services, especially when identity ownership is unclear.

Best practice is evolving toward intent-aware authorization, but there is no universal standard for this yet. Some teams use OPA or Cedar-style policies, others enforce mTLS and service mesh boundaries, and others rely on cloud-native conditional access. The common requirement is the same: access must be decided at runtime, not inferred later from logs. These controls tend to break down in flat networks with shared service accounts because a single credential can still authenticate broadly before any detector can intervene.

Common Variations and Edge Cases

Tighter prevention controls often increase operational overhead, requiring organisations to balance blast-radius reduction against deployment complexity. That tradeoff is real in mixed estates, where legacy services, batch jobs, and vendor integrations still depend on long-lived credentials or broad network reach.

In containerised or multi-cloud environments, observability can still be useful for spotting unusual sequences, but it should be treated as a signal source, not the enforcement plane. Where workloads are highly ephemeral, static role models fail quickly because the identity that exists at startup may not reflect what the workload is trying to do minutes later. This is especially true for agentic or automated systems that chain tools dynamically, but the same principle applies to traditional services that inherit privileges through shared infrastructure.

There is no universal standard for how much context an authorization engine should consume. Current guidance suggests using enough context to make a defensible decision without turning every request into a manual review. That often means binding identity to workload, binding credentials to task duration, and revoking access immediately after completion. For broader background on the NHI risk model, the Ultimate Guide to NHIs and its standards overview are useful references.

The edge case most teams miss is not sophisticated malware, but an ordinary service account that can still reach too much because the secret is valid and the runtime never said no.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 and OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Non-Human Identity Top 10 NHI-03 Short-lived secrets reduce reuse after compromise.
OWASP Agentic AI Top 10 A-04 Runtime authorisation matters when workloads act dynamically.
NIST AI RMF AI risk management requires runtime controls for autonomous behaviour.

Rotate and scope workload secrets so a stolen credential cannot persist across internal hops.