Envoy DNS use-after-free shows the blast radius of dependencies

Q: How a DNS retry sequence can trigger a use-after-free

c-ares performs asynchronous DNS lookups, including search-domain retries when an initial query returns NXDOMAIN. In the reported case, a connection error during the retry path caused the resolver to tear down a server connection while another callback still expected that object to exist. That is the classic shape of a use-after-free: memory is released, then later accessed by code that still holds a stale reference. The bug lived in an edge-case interaction between response handling and cleanup logic, which is why ASan was needed to surface it quickly. Practical implication: Instrument resolver paths with memory-safety tooling in pre-production so cleanup callbacks cannot outlive the objects they reference.

Q: Why Envoy and Pomerium inherit resolver failure risk

Envoy uses c-ares for DNS resolution, and Pomerium embeds Envoy as its data plane. That means a resolver defect can propagate upward as a proxy crash even when the application logic is not directly at fault. This is a dependency-chain issue: every layer inherits the operational assumptions of the layer beneath it. In access infrastructure, the blast radius is often larger than the bug itself because the control plane is part of the service path. Practical implication: Map critical dependencies from application to library level and treat shared resolvers as production-grade components, not opaque implementation details.

Q: Why timing and environment made the bug hard to catch

The crash only appeared under a specific Kubernetes setup using NodeLocal DNSCache and very high query load. That points to a concurrency and timing boundary, not a deterministic failure in ordinary operation. Bugs like this often evade normal test coverage because they require a precise response pattern plus transient network behavior. When the failure mode is environment-specific, operational telemetry and replayable tests become the only practical route to root cause analysis. Practical implication: Reproduce resolver faults under realistic load and clustered DNS conditions, not just with unit tests in isolated lab environments.

By NHI Mgmt Group Editorial TeamPublished 2025-08-22Domain: Breaches & IncidentsSource: Pomerium

TL;DR: A production crash was traced to a use-after-free bug in Envoy’s DNS resolver, c-ares, where a specific NXDOMAIN, search-domain retry, and connection-refused sequence could trigger a heap fault and remote denial of service, according to Pomerium. The case shows how dependency failures become application availability risks when identity-aware access relies on deep networking stacks.

At a glance

What this is: This is a post-mortem on a c-ares use-after-free bug in Envoy’s DNS resolver that could crash Pomerium through a specific DNS failure sequence.

Why it matters: It matters to IAM and access architects because identity-aware gateways, proxies, and control planes inherit availability risk from deep dependencies that sit below the policy layer.

By the numbers:

Pomerium said c-ares versions <= 1.34.5 were affected by the crash condition.
1.33.14, m published fixes in Envoy 1.33.14, 1.34.12, 1.35.8, and 1.36.4.
2025-12-10 after reproducing it with ASan., after reproducing it with ASan.

👉 Read Pomerium's analysis of the Envoy DNS use-after-free bug

Context

DNS reliability is a dependency problem, not just a networking one. When an identity-aware proxy depends on asynchronous resolver logic, a memory-safety flaw in that lower layer can take down the access path even though the policy engine itself is functioning correctly.

This Pomerium post is about how a narrow resolver bug in c-ares surfaced as a production crash inside Envoy and then in the proxy built on top of it. For IAM teams, the lesson is that control-plane resilience depends on the stability of every library in the access chain, not only on the identity policy layer.

Key questions

Q: What breaks when DNS resolver bugs affect an identity-aware proxy?

A: When a DNS resolver bug affects an identity-aware proxy, the failure is often broader than a single lookup error. The proxy can crash, policy enforcement stops, and all users or workloads depending on that path lose access. For IAM teams, that means resolver stability is part of access assurance, not a separate networking detail.

Q: Why do shared libraries create identity risk in access infrastructure?

A: Shared libraries create identity risk because they sit underneath policy enforcement and can fail without warning. If the proxy, gateway, or sidecar that enforces identity decisions depends on a buggy runtime component, the access path can disappear even when credentials and policies are correct. That is a control-plane availability problem, not just an application bug.

Q: How do teams know if their access path is resilient enough?

A: Teams know the access path is resilient enough when they can replay malformed inputs, DNS failures, and retry storms without taking down the enforcement layer. The useful signal is not whether normal traffic works, but whether the system remains stable under the exact failure patterns that proxy dependencies create.

Q: Who should own failures in embedded access dependencies?

A: Ownership should sit with the team that ships the access path, even when the bug lives in a third-party dependency. If the proxy or control plane fails, the business impact is local to that service, so the owner must track patching, test coverage, and recovery expectations for the full dependency chain.

Technical breakdown

How a DNS retry sequence can trigger a use-after-free

c-ares performs asynchronous DNS lookups, including search-domain retries when an initial query returns NXDOMAIN. In the reported case, a connection error during the retry path caused the resolver to tear down a server connection while another callback still expected that object to exist. That is the classic shape of a use-after-free: memory is released, then later accessed by code that still holds a stale reference. The bug lived in an edge-case interaction between response handling and cleanup logic, which is why ASan was needed to surface it quickly.

Practical implication: Instrument resolver paths with memory-safety tooling in pre-production so cleanup callbacks cannot outlive the objects they reference.

Why Envoy and Pomerium inherit resolver failure risk

Envoy uses c-ares for DNS resolution, and Pomerium embeds Envoy as its data plane. That means a resolver defect can propagate upward as a proxy crash even when the application logic is not directly at fault. This is a dependency-chain issue: every layer inherits the operational assumptions of the layer beneath it. In access infrastructure, the blast radius is often larger than the bug itself because the control plane is part of the service path.

Practical implication: Map critical dependencies from application to library level and treat shared resolvers as production-grade components, not opaque implementation details.

Why timing and environment made the bug hard to catch

The crash only appeared under a specific Kubernetes setup using NodeLocal DNSCache and very high query load. That points to a concurrency and timing boundary, not a deterministic failure in ordinary operation. Bugs like this often evade normal test coverage because they require a precise response pattern plus transient network behavior. When the failure mode is environment-specific, operational telemetry and replayable tests become the only practical route to root cause analysis.

Practical implication: Reproduce resolver faults under realistic load and clustered DNS conditions, not just with unit tests in isolated lab environments.

Threat narrative

Attacker objective: The objective is remote denial of service by forcing the access proxy or any c-ares-backed service to crash.

Entry occurs through a specific DNS response sequence that includes NXDOMAIN and a retry path in the resolver.
Escalation occurs when a connection error destroys a server object while cleanup logic still holds a live reference to it.
Impact is a heap use-after-free that crashes Envoy and takes the identity-aware access path down with it.

Cisco DevHub NHI breach — IntelBroker exploited exposed Cisco credentials, API tokens and keys in DevHub.
LiteLLM PyPI package breach — LiteLLM PyPI supply chain attack, credentials stolen from users.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Dependency resilience is part of identity governance, not a separate reliability concern. Identity-aware gateways and access proxies sit in the control path for human, NHI, and service traffic, so a crash in a shared library becomes an access outage even when IAM policy is unchanged. That makes resolver stability, runtime observability, and dependency hygiene part of the governance surface. Practitioners should treat the access path as a governed system, not just a policy decision.

Blast radius, not policy correctness, is the real control variable in identity-aware infrastructure. The Pomerium case shows that a narrow defect in c-ares can disable the proxy that enforces access decisions. This is why NIST CSF resilience thinking matters alongside access control: if the enforcement layer fails, authorization intent never reaches users or workloads. Identity teams should evaluate how much of their access stack depends on shared libraries they do not directly operate.

Deep dependency bugs expose an identity operations gap that most teams under-model. Access reviews and privilege governance often focus on who can reach a system, not on whether the system’s enforcement path can survive malformed inputs or transient network states. That gap matters for NHI-heavy environments where proxies, DNS, and sidecars are always on the critical path. Practitioners should fold dependency fault tolerance into IAM and PAM risk reviews.

Memory-safety failures in infrastructure libraries are governance failures when they interrupt trusted access flows. The specific issue here was not credential abuse, but the operational consequence is similar: users lose access because the enforcement plane becomes unstable. That is a reminder that identity programmes need a broader failure model than policy misconfiguration alone. The implication is that availability and control enforcement are inseparable for modern access architectures.

Runtime debugging evidence is part of modern access assurance. The authors’ use of ASan and a reproducible test demonstrates how deep faults become governable only when teams can instrument the full path from request to cleanup. For identity architects, that means testability is not a developer luxury; it is part of operational assurance for access infrastructure. Teams should be able to prove the access plane behaves safely under abnormal network conditions.

From our research:
97% of NHIs carry excessive privileges, increasing unauthorised access and broadening the attack surface, according to Ultimate Guide to NHIs.
71% of NHIs are not rotated within recommended time frames, which means control drift often persists long after the initial deployment decision.
For deeper lifecycle context, review the Ultimate Guide to NHIs for visibility, rotation, and offboarding patterns that reduce dependency blast radius.

What this signals

Resolver resilience is becoming part of identity programme maturity. As more access decisions flow through proxies, gateways, and sidecars, teams need to assess whether a low-level library fault can interrupt authentication and authorisation paths. The programme signal is simple: if the access layer cannot survive abnormal DNS behavior, the identity control model is only partially enforceable.

A useful concept here is dependency blast radius: the distance between a library fault and the business service that loses access. Practitioners should shorten that radius by knowing exactly which runtime components sit between a request and a policy decision, then validating their failure behavior under load.

For governance teams, this kind of incident reinforces the need to align access controls with NIST Cybersecurity Framework 2.0 resilience thinking. The practical shift is to test enforcement continuity, not just entitlement correctness, whenever a shared network component is in the access chain.

For practitioners

Inventory deep access dependencies Document every library and resolver used by identity-aware proxies, sidecars, and policy enforcement components. Include version bounds, owner, and whether the component can crash the access path if it faults.
Test resolver failure paths under load Run replayable chaos and memory-safety tests against DNS retry, timeout, and connection-refused scenarios in clustered environments. Include NodeLocal DNSCache or equivalent local resolution layers where they exist.
Treat shared runtime libraries as patch-priority assets Track c-ares, Envoy, and similar low-level dependencies with the same patch urgency you use for exposed access services. A bug below the policy layer can still become an identity outage.
Add control-plane resilience to IAM risk reviews Review whether your access stack can still enforce policy when DNS, proxy callbacks, or embedded networking libraries fail. If not, record the failure mode as an operational identity risk.

Key takeaways

A DNS resolver use-after-free can become an identity availability incident when access enforcement depends on the affected runtime.
The evidence here was a reproducible crash under a specific retry and DNS cache condition, not a generic application failure.
Teams should govern the full access path, including shared libraries and resolver behavior, because policy correctness does not matter if enforcement cannot stay up.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST CSF 2.0		Resilience and recovery apply because resolver faults can take down the access plane.
NIST Zero Trust (SP 800-207)	PR.AC	Identity-aware proxies enforce access decisions in the trust path and must remain available.
OWASP Non-Human Identity Top 10	NHI-03	The incident shows how non-human infrastructure dependencies can create broad operational exposure.

Map access-path failure modes to CSF resilience outcomes and test enforcement continuity under abnormal DNS behavior.

Key terms

Use-after-free: A use-after-free happens when software continues to read or write memory after that memory has already been released. In access infrastructure, the result can be a crash or unpredictable behavior in components that sit on the enforcement path, even if the higher-level policy logic is correct.
Dependency blast radius: Dependency blast radius is the amount of service impact created when a lower-level component fails. In identity-aware systems, it describes how far a resolver, proxy library, or sidecar defect can propagate before it disrupts authentication, authorisation, or session continuity.
Identity-aware proxy: An identity-aware proxy is an enforcement layer that decides whether a request should reach an application based on identity and policy context. It sits in the path between user or workload traffic and the target service, so its own stability becomes part of the access control model.
Asynchronous DNS resolution: Asynchronous DNS resolution lets software request name lookups without blocking the main execution flow. That design improves performance, but it also introduces callback and cleanup complexity, which can create hard-to-reproduce faults when retries, timeouts, or connection errors overlap.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity lifecycle are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.

This post draws on content published by Pomerium: It's always DNS part ∞, tracking down a use-after-free bug in Envoy's DNS resolver c-ares. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-08-22.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org