Subscribe to the Non-Human & AI Identity Journal

What breaks when vulnerability scan data is stored directly in etcd at scale?

etcd can run out of space, which pushes the Kubernetes API server into read-only mode and blocks ordinary writes. That means deployments, scaling actions, and deletions can fail even though the underlying cause is just security telemetry growth. The break point is not visibility itself, but uncontrolled report volume and revision churn.

Why This Matters for Security Teams

Storing vulnerability scan results directly in etcd turns a useful security feed into infrastructure risk. etcd is the Kubernetes control plane data store, so growth in scan volume, retention, and revision churn competes with the same storage that powers deployments, scaling, and deletions. Once that store becomes pressured, the symptom is not just slower scanning. Control-plane writes can fail, and the cluster can become operationally constrained.

This is a classic NHI-style failure mode because the problem is not the data itself, but the identity-backed workload creating it at scale. Security telemetry from service accounts, scanners, and controllers often expands faster than teams expect, especially when ownership is diffuse. NHI Mgmt Group notes that only 5.7% of organisations have full visibility into their service accounts in the Ultimate Guide to NHIs – Key Research and Survey Results, which helps explain why workload-generated data often grows without a governance boundary.

For broader threat context, CISA cyber threat advisories repeatedly show that operational fragility is a security issue, not just an availability issue. In practice, many security teams encounter etcd pressure only after ordinary cluster writes have already started failing, rather than through intentional capacity planning.

How It Works in Practice

etcd stores Kubernetes state as key-value data with revision history. If vulnerability scan output is written into that store directly, each report, update, or re-ingest can add more objects and more revisions. That creates a compounding storage pattern: larger payloads, more churn, and more compaction pressure. The control plane then has to process security telemetry alongside the state it needs to keep the cluster healthy.

Practitioners usually run into trouble when scan data is treated like durable application state instead of short-lived operational data. Good practice is to keep only the minimum status needed for reconciliation in etcd, then push full findings to an external system built for retention, search, and analytics. The Top 10 NHI Issues research is relevant here because unmanaged workload behavior often creates hidden blast radius before anyone notices the control plane is under strain.

  • Store identifiers, summary status, and pointers in etcd, not full finding payloads.
  • Use an external datastore or security platform for raw scan history and long-term retention.
  • Set size limits, TTLs, and compaction policies for any controller-written objects.
  • Throttle ingestion so repeated scans do not create revision storms.
  • Monitor etcd object count, db size, and apiserver write latency together, not separately.

Current guidance suggests treating scan telemetry as an externalized workload artifact rather than a control-plane resident dataset, but there is no universal standard for exactly where that boundary should sit. These controls tend to break down in high-churn clusters with frequent rescans, because repeated writes and retention defaults can outpace compaction and fill the store faster than operators expect.

Common Variations and Edge Cases

Tighter retention often increases operational overhead, requiring organisations to balance forensic depth against control-plane stability. That tradeoff becomes sharper when compliance teams want long-lived evidence and platform teams want a small, fast etcd footprint.

One edge case is a small cluster that looks safe in testing but fails under production scan volume, especially if multiple scanners, namespaces, or controllers all write similar results. Another is a multi-tenant environment where each tenant generates separate findings objects, multiplying revisions even when the total finding count seems manageable. The operational answer is often to keep etcd as the source of truth for workflow state only, while sending detailed scan artifacts to a purpose-built store with access controls and lifecycle management.

This is also where governance matters. If the scanners are service accounts or autonomous pipelines, their write patterns should be reviewed like any other NHI behavior, not assumed to be harmless because the content is “security data.” Best practice is evolving, but the principle is stable: control-plane storage should remain small, predictable, and reversible. The Ultimate Guide to NHIs – Why NHI Security Matters Now is a useful reminder that unmanaged non-human activity is a recurring source of enterprise risk.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 and CSA MAESTRO address the attack and risk surface, while NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Non-Human Identity Top 10 NHI-03 Direct storage of scan output can create uncontrolled NHI data growth and retention risk.
NIST CSF 2.0 PR.DS-1 etcd data overload affects protection of data in operational systems and service availability.
CSA MAESTRO Agentic or automated scanners can overwhelm shared state stores without workload-aware guardrails.

Limit NHI-generated telemetry in control-plane stores and externalize bulky artifacts with clear retention rules.