What Is Flaky Test? Definition & Examples

Expanded Definition

A flaky test is more than an annoying intermittent failure. In NHI and agentic AI environments, it often signals nondeterminism in the execution path, such as timing variance, race conditions, shared environment contamination, or unstable dependencies. That makes the test result unreliable as evidence of control effectiveness.

In mature delivery pipelines, flaky tests are especially dangerous because they blur the line between a broken control and a broken test harness. A test may pass on one run and fail on the next even when the underlying code, secret, or policy has not changed. This is why teams treating identity controls as software should separate deterministic control checks from environment-sensitive integration checks, and align them with guidance in the NIST Cybersecurity Framework 2.0 where repeatability and measurable assurance matter.

Definitions vary across vendors when flaky tests are discussed in CI/CD, QA, or security automation, but the operational meaning is consistent: a test that cannot reliably prove the control it is supposed to validate. The most common misapplication is treating intermittent failures as proof of a security defect when the real issue is nondeterministic test setup, which occurs when test data, timing, or shared state is not isolated.

Examples and Use Cases

Implementing flaky-test suppression rigorously often introduces slower pipelines and more maintenance, requiring organisations to weigh faster feedback against higher confidence in release and security validation.

A secret-scanning test fails only when a CI runner loads cached environment variables from a previous job, masking whether the secret is actually exposed in source control.

An access-control test for an AI agent passes locally but fails in shared staging because another job modifies the same service account between setup and assertion.

A policy test intended to verify token rotation intermittently fails because the test fixture depends on wall-clock timing rather than a controlled clock.

A regression test for NHI offboarding becomes unstable when revocation propagation is delayed, making it unclear whether the control or the environment is at fault. See the Ultimate Guide to NHIs for why lifecycle controls must be observable end to end.

An agent-tool authorization test that checks least privilege fails sporadically when external API rate limits change, which can resemble a security gap even though the root cause is test nondeterminism.

In practice, teams use deterministic fixtures, isolated test identities, and explicit dependency control to distinguish a genuine identity failure from a broken test harness. For baseline identity assurance patterns, the NIST Cybersecurity Framework 2.0 remains a useful anchor for repeatable control verification.

Why It Matters in NHI Security

Flaky tests undermine trust in NHI security automation because they create false confidence when risky changes slip through and false alarms when teams begin ignoring broken checks. Over time, this erodes the value of automated validation for secrets handling, service account governance, and agentic tool access.

The impact is amplified by the scale of NHI exposure. NHI Mgmt Group reports that Ultimate Guide to NHIs notes 79% of organisations have experienced secrets leaks, and 77% of those incidents caused tangible damage. When security tests are flaky, that damage becomes harder to prevent because teams cannot trust the evidence they use to approve deployments or validate remediation.

Flaky tests also interfere with governance because they hide whether a control failure is real, which slows response to findings that should trigger rotation, offboarding, or policy tightening. Organisations typically encounter the operational cost of flaky tests only after a breach, leaked credential, or failed audit, at which point reliable validation becomes operationally unavoidable to address.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST CSF 2.0	GV.OV	CSF 2.0 requires measurable, repeatable oversight of security controls.
OWASP Non-Human Identity Top 10	NHI-01	Flaky tests obscure verification of NHI lifecycle and access controls.
NIST AI RMF		AI RMF emphasizes reliable testing and monitoring for trustworthy systems.

Make NHI test results repeatable enough to support trustworthy oversight and control validation.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Flaky Test

Expanded Definition

Examples and Use Cases

Why It Matters in NHI Security

Standards & Framework Alignment

Related resources from NHI Mgmt Group