Agentic AI for flaky tests: what Kong’s workflow reveals

By NHI Mgmt Group Editorial TeamPublished 2026-06-10Domain: Agentic AI & NHIsSource: Kong

TL;DR: An agentic workflow fixed 12 of its 15 flakiest tests over a week and a half, cutting a process that can take an engineer a full day per pair of failures into a 4 to 5 hour loop with no human intervention, according to Kong. The deeper lesson is that runtime autonomy changes how teams should think about debugging, verification, and context management, not just speed.

At a glance

What this is: Kong describes an agentic AI workflow that investigates flaky tests, proposes fixes, and verifies reruns to stabilise a large Gateway test suite.

Why it matters: It matters because autonomous debugging workflows change how engineering teams should govern tool access, verification loops, and accountability across AI, NHI, and human-controlled development pipelines.

By the numbers:

Over a week and a half, the workflow fixed 12 of the 15 flakiest tests on the dashboard.
One full test run takes about 23.5 hours on a single machine across 34,000 test cases.

👉 Read Kong's analysis of agentic AI fixing flaky tests in Kong Gateway

Context

Agentic AI in software engineering is moving from code generation into task execution, where a system can inspect logs, form a hypothesis, write a fix, and run verification without human approval at each step. That matters for identity governance because the control question is no longer just who can commit code, but what runtime authority the agent has while it is debugging, branching, and pushing changes.

Flaky tests are a classic reliability problem, but they also expose a governance problem: the work requires repeated access to repositories, logs, CI results, and branch creation. As engineering teams let agents take on more of that loop, the relevant identity boundary shifts from a person clicking through a workflow to a machine identity acting inside a bounded development system.

The article is a good example of a controlled engineering use case rather than a fully autonomous production deployment. The agent operates inside a defined review and verification loop, so the operational lesson is about constrained runtime delegation, not handing general authority to AI.

Key questions

Q: How should teams govern agentic AI workflows that can branch and commit code?

A: Treat them as governed runtime identities, not as ordinary automation. Scope repository, CI, and branch permissions to the smallest viable task, separate analysis from verification roles, and log every decision point. The key control is not speed, but whether the agent can only perform the actions it was explicitly authorised to perform.

Q: Why do agentic debugging workflows create new IAM risk even when they stay inside CI?

A: Because they combine log access, code access, branch creation, and repeated verification into one delegated execution path. That chain can cross multiple trust boundaries without human pacing, so the risk is privilege accumulation inside the workflow rather than external intrusion. Teams need to govern the chain, not just the endpoint.

Q: What breaks when an AI agent keeps too much context across troubleshooting runs?

A: It becomes easier for stale hypotheses to shape new actions, which increases false confidence and widens the chance of repeated misdiagnosis. Short, task-specific context forces the next run to re-evaluate evidence rather than inherit old assumptions. That improves both reliability and auditability in long-running agentic work.

Q: How do human reviewers stay accountable when an AI agent prepares the fix?

A: Human reviewers should approve the change after deterministic verification, not after a model claims success. The agent can accelerate diagnosis and remediation, but the reviewer must own the final merge decision, because accountability belongs to the change owner, not to the runtime workflow.

Technical breakdown

How agentic test repair loops combine log access, branching, and verification

The workflow Kong describes is an identify-fix-verify loop. A top-level orchestrator reads a flaky test export, downloads recent CI logs, and spawns a specialist subagent to analyse the failure, propose a patch, and create a branch. A second verifier agent then reruns the test until the system can distinguish a real fix from noisy surrounding failures. The architecture is important because it turns debugging into a multi-step runtime process with separate identities for analysis and verification, rather than a single script that blindly edits code.

Practical implication: treat each agent role as a distinct workload identity with narrowly scoped repository and CI permissions.

Why context management matters in long-running AI debugging tasks

The article shows that context size is not just a model-performance issue. The flake-fixer agent only keeps the information needed for the current investigation, then hands a summary to the orchestrator before the next attempt starts. That pattern reduces token cost, avoids dragging stale hypotheses into a new run, and helps the agent stay grounded in logs instead of guessing. In identity terms, this is a runtime containment pattern: the task is decomposed so the agent does not accumulate unnecessary decision history across iterations.

Practical implication: design agent workflows so each run has a short-lived context and a clear handoff point.

What makes verification loops different from ordinary automation

A conventional CI job runs a fixed script. An agentic verifier makes judgment calls while parsing messy outputs, deciding whether the targeted test actually passed, and repeating until a streak of clean runs appears. That means the system is not just automating execution, it is making runtime decisions about evidence quality and stopping conditions. For IAM and NHI governance, that raises the bar for auditability because the important question becomes whether the agent can explain why it considered the fix valid, not merely whether the pipeline completed.

Practical implication: require traceable verification criteria whenever an agent can decide when to stop or retry.

Threat narrative

Attacker objective: The workflow’s objective is not malicious, but the structural target is the same as an attacker’s in an autonomy-heavy environment: use legitimate access to perform chained actions faster than a human can review them.

Entry occurs through legitimate CI and repository access granted to the agentic workflow, not through external compromise.
Escalation happens when the workflow gains the ability to branch, commit, and rerun verification without human intervention between steps.
Impact is improved test stability, but also a wider delegated execution surface inside engineering systems that must be governed.
The main objective is to compress flaky-test remediation from a human bottleneck into an autonomous debugging and verification loop.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Runtime delegation is now an identity problem, not just a developer productivity problem. Kong’s workflow shows that an agent can inspect evidence, decide what to try next, and move from diagnosis to branch creation without a human in the loop. That makes the agent a runtime actor whose access, timing, and tool use must be governed as a distinct non-human identity. The practical conclusion is that engineering teams need to classify agentic debugging paths as governed execution, not simple automation.

Least privilege for agentic workflows becomes harder the more a system can chain decisions. The same workflow that can read logs and propose fixes can also create branches, push commits, and trigger repeated verification runs. That expands the trust boundary across repository, CI, and review systems. In OWASP Agentic AI terms, the risk is not only tool access but the ability to combine tools in sequence. Practitioners should re-evaluate whether each step in the loop is separately authorised.

Context compression is a governance control, not only a cost optimisation. By forcing each subagent to work from a narrow slice of evidence and a fresh context, Kong reduces the chance that stale reasoning drives later actions. That is a useful pattern for any AI or NHI workflow where one run should not inherit broad memory from the last. The lesson is that bounded context helps contain both error propagation and privilege drift.

Autonomous debugging only works when the verification boundary is explicit. The system does not declare success because a model feels confident. It waits for repeated evidence from reruns and only then opens a pull request for human review. That is the right governance direction for agentic work: machine judgment can accelerate investigation, but final accountability still has to be anchored in deterministic verification and human approval.

From our research:
80% of organisations report their AI agents have already performed actions beyond their intended scope, including accessing unauthorised systems, inappropriately sharing sensitive data, and revealing access credentials, according to AI Agents: The New Attack Surface report.
Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation, according to AI Agents: The New Attack Surface report.
For a broader framework view, the OWASP NHI Top 10 is a useful next stop for teams designing agentic access controls and verification boundaries.

What this signals

Flaky-test remediation is becoming a proving ground for agent governance. Once an agent can inspect logs, branch code, and retry verification on its own, the real question is whether the organisation can prove the agent stayed inside the task boundary. Kong’s workflow is a useful pattern because it shows how a narrow delegated job can remain reviewable, but only if the permissions model and audit trail are designed around the agent's runtime behaviour.

AI agent security programmes should start by defining where context ends. Long-running agents fail governance when they carry too much prior reasoning into the next action, because that creates hidden state that humans cannot easily inspect. The strongest operating model is a short-lived task context paired with explicit handoff logs, which makes it easier to review what the agent knew, what it changed, and why it stopped.

Identity teams should expect more engineering workflows to look like controlled machine identities. That shifts programme priorities toward scoped credentials, approval gates, and evidence capture for non-human actors that can act continuously inside CI and developer tooling. If your governance model still assumes a person behind every change, this class of workflow will expose the gap quickly.

For practitioners

Define agent-specific repository permissions Give debugging agents only the repository, branch, and CI permissions they need for a single task class. Separate analysis, fix, and verification identities so one agent cannot silently expand its own access across the workflow.
Require explicit stopping conditions for verification loops Document what counts as success, how many reruns are required, and what evidence the verifier must retain before a pull request is opened. This prevents the agent from deciding that noisy output is good enough.
Limit context retention between runs Pass forward only a short summary of prior attempts, not the full investigative history. That reduces stale reasoning, lowers token cost, and keeps each run focused on current evidence rather than earlier guesses.
Separate human review from machine diagnosis Allow the agent to diagnose and propose, but require a human to approve merged changes even after the verification streak passes. That keeps accountability tied to the change, not to the model's confidence.

Key takeaways

Agentic test repair turns debugging into a governed runtime identity problem because the system can inspect, decide, and act across multiple tools without human pacing.
Kong’s results show both scale and speed, with 12 of 15 flaky tests fixed over a week and a half and individual fixes completed in 4 to 5 hours.
The practical lesson is to bound permissions, verification, and context so that autonomous remediation stays auditable and reviewable.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Agentic debugging uses tool-chaining and autonomous verification.
NIST AI RMF		AI workflows need governance, traceability, and accountability for runtime decisions.
NIST CSF 2.0	PR.AC-4	Agent workflows need least-privilege access across CI and repositories.

Scope agent permissions by tool, action, and stop condition before allowing autonomous debug loops.

Key terms

Agentic Workflow: An agentic workflow is a process in which software can decide what to do next, choose tools, and continue execution without a human approving each step. In identity terms, it is a runtime actor that needs scoped permissions, traceability, and clear stop conditions.
Flaky Test: A flaky test is a test that fails intermittently without a stable code change explaining the failure. The underlying problem is often nondeterminism in timing, ordering, or shared state, which makes diagnosis expensive and verification difficult.
Verification Loop: A verification loop is the repeated rerun and inspection cycle used to prove that a fix actually resolved an intermittent failure. In agentic systems, the loop matters because success must be evidenced by repeatable outcomes, not by model confidence alone.
Runtime Delegation: Runtime delegation is the handing of live execution authority to a non-human system while it is already operating, rather than only at provisioning time. It increases governance complexity because access, timing, and tool use can change during the task itself.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building or maturing an identity programme, it is worth exploring.

This post draws on content published by Kong: How We Used Agentic AI to Fix Kong Gateway's Flakiest Tests. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-06-10.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org