TL;DR: Kubernetes etcd can be pushed into a read-only failure state when Trivy Operator vulnerability reporting leaves report volume, object size, and MVCC history unchecked, according to Aqua Security. The lesson is that visibility tooling must be governed as infrastructure pressure, not just security telemetry.
NHIMG editorial — based on content published by Aqua Security: When Security Scans Break the Brain: Solving Trivy’s etcd Exhaustion Problem
By the numbers:
- Systems with least-privileged AI access had a 17% incident rate vs 76% for over-privileged systems, and organisations failing to scope AI access properly are 4.5x more likely to experience a security incident.
- Only 44% of organisations have implemented any policies to manage their AI agents, despite 92% agreeing that governing AI agents is critical to enterprise security.
Questions worth separating out
Q: What breaks when vulnerability scan data is stored directly in etcd at scale?
A: etcd can run out of space, which pushes the Kubernetes API server into read-only mode and blocks ordinary writes.
Q: When should teams move vulnerability reports out of the Kubernetes control plane?
A: Teams should move reports out of etcd when scan frequency, object size, or cluster churn makes control-plane storage part of the availability risk.
Q: How do you know if vulnerability reporting is creating hidden etcd pressure?
A: Watch for rising database size, slower write paths, and repeated growth after deletions, because MVCC keeps historical versions until compaction and defragmentation clean them up.
Practitioner guidance
- Bound report size before it reaches etcd Disable optional fields such as links, CVSS detail, and long descriptions where they materially increase VulnerabilityReport object size.
- Set a hard TTL for scan output Use a report retention policy such as a 24-hour TTL so stale vulnerability objects are deleted before they accumulate into persistent MVCC churn.
- Pair compaction with defragmentation Schedule etcd compaction and defragmentation as an operational runbook, not a one-time rescue action.
What's in the full article
Aqua Security's full blog post covers the operational detail this post intentionally leaves for the source:
- Step-by-step Trivy Operator configuration guidance for reducing etcd pressure in Kubernetes.
- Exact settings for alternate report storage, including the PVC and StorageClass requirements.
- A patch example for filtering vulnerability severities at the source.
- Operational notes on compaction, defragmentation, and alerting thresholds for etcd growth.
👉 Read Aqua Security's analysis of Trivy operator etcd exhaustion →
Trivy and etcd exhaustion: what Kubernetes teams need to change?
Explore further
View Full Forum → | NHI Foundation Course → | Our Services →
Security telemetry is part of the control plane, not a separate reporting layer. When vulnerability data is written into etcd, every scan decision becomes a platform availability decision. That shifts the governance question from how much visibility teams want to how much control-plane state they are willing to carry. Practitioners should treat reporting volume, storage placement, and retention as identity-adjacent infrastructure controls, not scanner preferences.
A few things that frame the scale:
- Only 1.5 out of 10 organisations are highly confident in their ability to secure NHIs, compared to nearly 1 in 4 for securing human identities, according to The State of Non-Human Identity Security.
- A separate survey found that 85% of organisations lack full visibility into third-party vendors connected via OAuth apps, showing how quickly identity sprawl becomes governance blind spots.
A question worth separating out:
Q: Who is accountable when security tooling contributes to a Kubernetes outage?
A: Accountability sits with both the platform team and the security team, because the failure is architectural. The control plane owner must govern storage limits and maintenance, while the security owner must bound report volume and data placement. The issue is shared design responsibility, not a single misconfigured scanner.
👉 Read our full editorial: Trivy etcd exhaustion shows the cost of security visibility at scale