A flaky test is a test that fails intermittently without a stable code change explaining the failure. The underlying problem is often nondeterminism in timing, ordering, or shared state, which makes diagnosis expensive and verification difficult.
Expanded Definition
A flaky test is more than an annoying intermittent failure. In NHI and agentic AI environments, it often signals nondeterminism in the execution path, such as timing variance, race conditions, shared environment contamination, or unstable dependencies. That makes the test result unreliable as evidence of control effectiveness.
In mature delivery pipelines, flaky tests are especially dangerous because they blur the line between a broken control and a broken test harness. A test may pass on one run and fail on the next even when the underlying code, secret, or policy has not changed. This is why teams treating identity controls as software should separate deterministic control checks from environment-sensitive integration checks, and align them with guidance in the NIST Cybersecurity Framework 2.0 where repeatability and measurable assurance matter.
Definitions vary across vendors when flaky tests are discussed in CI/CD, QA, or security automation, but the operational meaning is consistent: a test that cannot reliably prove the control it is supposed to validate. The most common misapplication is treating intermittent failures as proof of a security defect when the real issue is nondeterministic test setup, which occurs when test data, timing, or shared state is not isolated.
Examples and Use Cases
Implementing flaky-test suppression rigorously often introduces slower pipelines and more maintenance, requiring organisations to weigh faster feedback against higher confidence in release and security validation.
- A secret-scanning test fails only when a CI runner loads cached environment variables from a previous job, masking whether the secret is actually exposed in source control.
- An access-control test for an AI agent passes locally but fails in shared staging because another job modifies the same service account between setup and assertion.
- A policy test intended to verify token rotation intermittently fails because the test fixture depends on wall-clock timing rather than a controlled clock.
- A regression test for NHI offboarding becomes unstable when revocation propagation is delayed, making it unclear whether the control or the environment is at fault. See the Ultimate Guide to NHIs for why lifecycle controls must be observable end to end.
- An agent-tool authorization test that checks least privilege fails sporadically when external API rate limits change, which can resemble a security gap even though the root cause is test nondeterminism.
In practice, teams use deterministic fixtures, isolated test identities, and explicit dependency control to distinguish a genuine identity failure from a broken test harness. For baseline identity assurance patterns, the NIST Cybersecurity Framework 2.0 remains a useful anchor for repeatable control verification.
Why It Matters in NHI Security
Flaky tests undermine trust in NHI security automation because they create false confidence when risky changes slip through and false alarms when teams begin ignoring broken checks. Over time, this erodes the value of automated validation for secrets handling, service account governance, and agentic tool access.
The impact is amplified by the scale of NHI exposure. NHI Mgmt Group reports that Ultimate Guide to NHIs notes 79% of organisations have experienced secrets leaks, and 77% of those incidents caused tangible damage. When security tests are flaky, that damage becomes harder to prevent because teams cannot trust the evidence they use to approve deployments or validate remediation.
Flaky tests also interfere with governance because they hide whether a control failure is real, which slows response to findings that should trigger rotation, offboarding, or policy tightening. Organisations typically encounter the operational cost of flaky tests only after a breach, leaked credential, or failed audit, at which point reliable validation becomes operationally unavoidable to address.
Standards & Framework Alignment
This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.
OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST AI RMF set the governance and control requirements practitioners need to meet.
| Framework | Control / Reference | Relevance |
|---|---|---|
| NIST CSF 2.0 | GV.OV | CSF 2.0 requires measurable, repeatable oversight of security controls. |
| OWASP Non-Human Identity Top 10 | NHI-01 | Flaky tests obscure verification of NHI lifecycle and access controls. |
| NIST AI RMF | AI RMF emphasizes reliable testing and monitoring for trustworthy systems. |
Make NHI test results repeatable enough to support trustworthy oversight and control validation.
Related resources from NHI Mgmt Group
Deepen Your Knowledge
Reviewed and updated by the NHIMG editorial team on June 23, 2026.
NHI Mgmt Group — the #1 independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org