Trivy etcd exhaustion shows the cost of security visibility at scale

By NHI Mgmt Group Editorial TeamPublished 2026-06-30Domain: Workload IdentitySource: Aqua Security

TL;DR: Kubernetes etcd can be pushed into a read-only failure state when Trivy Operator vulnerability reporting leaves report volume, object size, and MVCC history unchecked, according to Aqua Security. The lesson is that visibility tooling must be governed as infrastructure pressure, not just security telemetry.

At a glance

What this is: This is Aqua Security's analysis of how Trivy Operator vulnerability reporting can exhaust Kubernetes etcd and block control-plane writes.

Why it matters: It matters because IAM and platform teams still treat security telemetry as low-risk metadata, when in practice report volume, storage model, and retention choices can break availability.

By the numbers:

Systems with least-privileged AI access had a 17% incident rate vs 76% for over-privileged systems, and organisations failing to scope AI access properly are 4.5x more likely to experience a security incident.
Only 44% of organisations have implemented any policies to manage their AI agents, despite 92% agreeing that governing AI agents is critical to enterprise security.

👉 Read Aqua Security's analysis of Trivy operator etcd exhaustion

Context

Kubernetes vulnerability reporting can become an availability problem when the reporting system writes too much data into etcd. Trivy Operator is designed to improve visibility, but its reports are stored as Kubernetes objects, which means large payloads, high churn, and long retention can turn telemetry into control-plane pressure.

For IAM and security architects, the issue is not whether vulnerability scanning is useful. The issue is whether the identity and workload security programme treats telemetry storage, retention, and object growth as part of the operating model rather than as an afterthought.

At scale, the same controls that improve security insight can destabilise the cluster if they are not bounded by storage limits, TTL policies, and data placement choices. That makes this an infrastructure governance problem as much as a scanner configuration problem.

Key questions

Q: What breaks when vulnerability scan data is stored directly in etcd at scale?

A: etcd can run out of space, which pushes the Kubernetes API server into read-only mode and blocks ordinary writes. That means deployments, scaling actions, and deletions can fail even though the underlying cause is just security telemetry growth. The break point is not visibility itself, but uncontrolled report volume and revision churn.

Q: When should teams move vulnerability reports out of the Kubernetes control plane?

A: Teams should move reports out of etcd when scan frequency, object size, or cluster churn makes control-plane storage part of the availability risk. If visibility only works by consuming scarce API-server state, the reporting model has crossed from useful into brittle. Alternate storage becomes the safer boundary at that point.

Q: How do you know if vulnerability reporting is creating hidden etcd pressure?

A: Watch for rising database size, slower write paths, and repeated growth after deletions, because MVCC keeps historical versions until compaction and defragmentation clean them up. If your alerting shows storage trending toward quota even when object counts look stable, the reporting layer is still accumulating internal history.

Q: Who is accountable when security tooling contributes to a Kubernetes outage?

A: Accountability sits with both the platform team and the security team, because the failure is architectural. The control plane owner must govern storage limits and maintenance, while the security owner must bound report volume and data placement. The issue is shared design responsibility, not a single misconfigured scanner.

Technical breakdown

Why VulnerabilityReport CRDs can overwhelm etcd

The Trivy Operator stores scan output as VulnerabilityReport custom resources, so every report becomes part of the Kubernetes control plane state. That is convenient for kubectl visibility, but it also means large findings, optional metadata, and many scanned images increase the size of what etcd must hold and replicate. Kubernetes API limits exist to protect the cluster, not to be tuned away. When a report crosses those limits, the write path fails before the security team gets the visibility it wanted.

Practical implication: cap report size and field verbosity before vulnerability data is written into the control plane.

How MVCC turns scan churn into database growth

etcd uses Multi-Version Concurrency Control, which means updates do not overwrite data in place. Each report refresh creates a new revision, and deletes create more history, so frequent rescans produce a growing chain of live and historical versions. Compaction removes old revisions from the database view, but defragmentation is what returns physical disk space to the operating system. Without both, a small object set can still consume a large quota over time.

Practical implication: pair TTL controls with scheduled compaction and defragmentation, not one or the other.

Why alternate report storage and severity filtering change the operating model

The operator’s alternate storage mode moves reports out of etcd and onto a persistent volume, which avoids the API server object limit and reduces control-plane churn. Severity filtering does something different: it lowers the number of objects created by suppressing low-value findings. These are architectural choices, not cosmetic ones. One changes where data lives, the other changes how much data exists. Both trade some convenience for a more stable platform.

Practical implication: decide whether your programme needs full object-level visibility or a safer reporting boundary, then configure accordingly.

NHI Mgmt Group analysis

Security telemetry is part of the control plane, not a separate reporting layer. When vulnerability data is written into etcd, every scan decision becomes a platform availability decision. That shifts the governance question from how much visibility teams want to how much control-plane state they are willing to carry. Practitioners should treat reporting volume, storage placement, and retention as identity-adjacent infrastructure controls, not scanner preferences.

Storage bloat is the failure mode, but the governance mistake is assuming visibility is free. The article shows that optional fields, repeated scans, and long-lived objects create a hidden cost inside the cluster’s brain. This is not a tooling defect alone. It is a programme design issue where security teams optimise for completeness without accounting for the operational footprint of the data they create. The implication is that reporting architecture must be reviewed with the same seriousness as the scan policy itself.

High-churn security data creates an identity blast radius for the platform. Once scan output is expressed as Kubernetes objects, the blast radius extends from vulnerability management into API server availability, pod lifecycle actions, and operational recovery. That pattern matters beyond Trivy because many machine-identity and workload-security systems also persist state in shared control planes. Practitioners should assume that any security control that writes continuously can become an availability dependency.

The right mental model is bounded visibility, not maximum retention. The article makes clear that compaction, defragmentation, alternate storage, and severity filtering are not interchangeable cleanup steps. They are different answers to different failure points in the write path. That distinction matters for platform governance because a team can be technically observant while still architecturally brittle. The implication is to define how much security state belongs in etcd before production scale forces the decision.

For Kubernetes security programmes, scanner governance is now an availability control. The old assumption that security telemetry is passive no longer holds when telemetry is stateful, frequent, and control-plane resident. This is the kind of failure mode that demands shared ownership between platform engineering and security architecture. Practitioners should govern scan output with the same change-control discipline they apply to workload deployment.

From our research:
Only 1.5 out of 10 organisations are highly confident in their ability to secure NHIs, compared to nearly 1 in 4 for securing human identities, according to The State of Non-Human Identity Security.
A separate survey found that 85% of organisations lack full visibility into third-party vendors connected via OAuth apps, showing how quickly identity sprawl becomes governance blind spots.
That same research also showed 1 in 4 organisations are already investing in dedicated NHI security capabilities, a sign that operational ownership is shifting from theory to programme design.

What this signals

The governance signal here is that security telemetry must be treated as stateful infrastructure. When a scanner writes detailed findings into the same control plane that runs the cluster, the reporting layer can become the outage layer. Practitioners should review where their security objects live before object volume becomes an availability event.

Control-plane telemetry debt: security data accumulates as history, not just as current state, and that creates hidden cost in Kubernetes environments. The more often teams refresh reports, the more they must account for MVCC growth, retention, and compaction as part of the operating model.

With 70% of organisations already granting AI systems more access than human employees in other environments, according to the 2026 Infrastructure Identity Survey, the broader signal is that machine-driven systems are producing more state, more frequently, across more shared platforms. That makes bounded state management a first-class identity governance concern.

For practitioners

Bound report size before it reaches etcd Disable optional fields such as links, CVSS detail, and long descriptions where they materially increase VulnerabilityReport object size. Validate the final object size against the Kubernetes API server limit before rollout, especially in clusters with many high-vulnerability images.
Set a hard TTL for scan output Use a report retention policy such as a 24-hour TTL so stale vulnerability objects are deleted before they accumulate into persistent MVCC churn. Revisit the TTL whenever scan frequency or cluster size changes.
Pair compaction with defragmentation Schedule etcd compaction and defragmentation as an operational runbook, not a one-time rescue action. Compaction reduces logical history, while defragmentation returns freed disk space to the host.
Move report storage out of etcd where scale demands it If object-level visibility is not essential, enable alternate report storage on a persistent volume and ensure a default StorageClass exists to support the claim. This changes the storage boundary without reducing the scan itself.
Filter to the severities you will actually act on Restrict report generation to HIGH and CRITICAL issues where the organisation cannot operationally respond to lower-severity findings. This reduces object churn while keeping the reporting stream aligned to actionability.

Key takeaways

Trivy-style visibility becomes a control-plane risk when vulnerability reports are persisted as Kubernetes objects without tight bounds.
etcd exhaustion is not just a storage problem, because MVCC revision churn can keep the database growing even after objects are deleted.
The practical fix is architectural, combining TTL, compaction, defragmentation, alternate storage, and severity filtering based on what the cluster can safely absorb.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST CSF 2.0	PR.PT-5	Security telemetry should not destabilise protected systems.
OWASP Non-Human Identity Top 10	NHI-03	Persistent security data needs lifecycle limits and deletion discipline.
NIST Zero Trust (SP 800-207)	SC-2	Control-plane writes must remain bounded and continuously monitored.

Limit scan-output state in shared control planes and monitor for storage pressure as an availability risk.

Key terms

VulnerabilityReport CRD: A VulnerabilityReport CRD is a Kubernetes custom resource used to store scan findings as first-class objects. In practice, that makes security telemetry part of cluster state, so report size, count, and update frequency can affect control-plane health as much as the findings themselves.
Etcd MVCC: etcd MVCC is the database model that keeps historical revisions instead of overwriting records in place. That design supports consistency, but it also means repeated updates and deletions leave behind version history until compaction and defragmentation reclaim space.
Control-plane pressure: Control-plane pressure is the operational strain created when too many writes, large objects, or frequent updates burden Kubernetes control components. For identity and security teams, it is the point where visibility tooling stops being passive and starts competing with workload operations.
Alternate report storage: Alternate report storage moves vulnerability output away from etcd and into a separate persistent volume. This reduces API-server object growth and protects the control plane, but it also changes how operators retrieve and manage scan data.

What's in the full article

Aqua Security's full blog post covers the operational detail this post intentionally leaves for the source:

Step-by-step Trivy Operator configuration guidance for reducing etcd pressure in Kubernetes.
Exact settings for alternate report storage, including the PVC and StorageClass requirements.
A patch example for filtering vulnerability severities at the source.
Operational notes on compaction, defragmentation, and alerting thresholds for etcd growth.

👉 The full Aqua Security post covers Trivy configuration, storage offload, and etcd maintenance details.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-06-30.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org