TL;DR: AI agents can accelerate analysis and bounded refactoring in a multi-million-line Go monolith, but they fail quickly when sequencing, invariants, or context are incomplete, according to 1Password. The deeper lesson is that production governance depends on explicit constraints, because agentic execution still breaks when intent must be inferred at runtime.
At a glance
What this is: This is an analysis of how AI agents behaved during large-scale codebase refactoring, and the key finding is that they were useful for deterministic analysis but unreliable when sequencing and invariants were underspecified.
Why it matters: It matters because IAM, NHI, and emerging autonomous governance programmes all depend on knowing when machine actors can be trusted to act within bounds and when human-defined controls must stop them.
By the numbers:
- The migration required updating more than 3,000 call sites across production and test code.
- 20-30% improvement.
👉 Read 1Password's analysis of AI agentic refactoring in production systems
Context
AI agentic refactoring changes the shape of production engineering because the actor is no longer just generating code, it is also sequencing changes, inferring dependencies, and deciding when a task is complete. That makes the governance problem an identity problem as much as an engineering one, because the actor is operating with machine-scale reach inside live systems.
The weak point is not raw generation quality. The weak point is the assumption that a system can be safely decomposed, migrated, or rewritten without an explicit model of boundaries, dependency order, and escalation paths. That is familiar territory for NHI governance, but the arrival of agentic execution raises the bar on how precisely those boundaries must be defined.
Key questions
Q: How should teams govern AI agents that refactor production systems?
A: Teams should govern agentic refactoring as a constrained execution problem, not a free-form coding problem. The agent should work from explicit manifests, bounded file access, and clear stop conditions. If the change depends on sequencing, live state, or irreversible transitions, human review must control the next step before the agent proceeds.
Q: Why do AI agents struggle with large production migrations?
A: They struggle because large migrations depend on ordering, invariants, and state awareness, while agents tend to optimise for local completion. A change that looks correct in isolation can still break schema evolution, shared ownership, or deployment sequencing. The failure is usually not code quality. It is hidden dependency management.
Q: What breaks when agents infer missing context during execution?
A: When agents infer missing context, the workflow starts to operate on assumptions instead of validated facts. That can propagate a wrong identifier, an incorrect ownership model, or an unsafe sequence through the entire task. In production systems, speculation is not a small error. It is a control failure.
Q: Who is responsible when agentic tooling makes a bad migration decision?
A: Accountability remains with the team that defined the workflow and granted the agent execution authority. A machine actor can propose or execute changes, but it cannot own the governance model around sequencing, validation, and rollback. That responsibility sits with the programme that allowed the actor to act.
Technical breakdown
Why deterministic artifacts matter more than model output
The article distinguishes between using agents to reason and using agents to produce stable operational artefacts. In practice, the useful pattern was to let agents help build analyzers, manifests, and plans, then treat those outputs as the real control surface. That reduces the impact of model variance because engineers are no longer arguing with a prediction. They are reviewing a reproducible structure that can be tested, versioned, and constrained. This is especially important when code touches data ownership, security boundaries, or request paths. The more sensitive the system, the more the output must become deterministic before execution starts.
Practical implication: require agents to generate reviewable artefacts before they are allowed to change production code.
Why sequencing and invariants break agentic refactoring
Large refactors fail when the order of operations matters more than the code itself. The article’s examples show that backfilling schema fields before changing write paths, or treating shared tables as independently owned, can create silent data loss and deployment conflict. These are not syntax errors. They are sequencing failures that appear only when the migration interacts with live state. In identity terms, the problem is not whether the actor can propose a change, but whether it understands constraints that must remain true across every step of execution.
Practical implication: make ordering constraints and invariants explicit before any agent touches stateful systems.
What speculative completion means in agentic workflows
The article describes a recurring pattern called speculation, where the agent fills context gaps with plausible but unverified assumptions. That behaviour is useful in brainstorming, but dangerous in production migration work because a single wrong inference can propagate through an entire change set. This is the same failure mode security teams see when runtime systems operate on incomplete policy context. Once the system assumes the missing fact is true, the error is no longer local. It becomes embedded in the change plan, the implementation, and the rollback scope.
Practical implication: force explicit escalation whenever the agent encounters ambiguity instead of letting it infer missing facts.
Breaches seen in the wild
- Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
- AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.
Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.
NHI Mgmt Group analysis
Agentic refactoring turns code migration into an identity governance problem. The article shows that the real control boundary is no longer just source control or CI/CD. It is the point at which a machine actor is allowed to interpret dependencies, sequence changes, and decide whether a pattern is safe enough to generalise. For NHI and emerging autonomous governance teams, that means the actor’s permissions are only one part of the risk picture. The harder issue is whether the workflow itself was designed to tolerate machine judgment without creating hidden state changes.
Deterministic tooling is the governance pattern that survived first contact with agentic execution. The strongest result in the article came from using agents to build reproducible analyzers rather than letting them free-run through the codebase. That aligns with OWASP NHI thinking: the safest machine actor is the one constrained by artefacts, not by hope. For practitioners, the implication is clear. When an agent can only succeed by improvising, the workflow is already under-specified.
Least privilege for autonomous execution is not a provisioning question alone. Access review and approval models were designed for actors whose intent is stable long enough to be observed before action completes. That assumption fails when the agent can plan, execute, and continue chaining work across a session without a human pacing each step. The implication is not simply to add more approval points. It is to recognise that the control model itself was built for slower, more legible actors.
Incomplete specifications create implicit governance, and implicit governance is the real failure mode here. When the article says agents will fill in the gaps, it is describing a control boundary that has already been crossed. In NHI terms, the system is no longer governed by declared policy alone. It is governed by whatever the agent inferred in the moment. Practitioners should treat that as a design defect in the workflow, not as an acceptable trade-off.
Named concept: specification drift. This post shows that production agentic work fails when the written task and the actual execution path diverge under ambiguity, especially in stateful systems. That drift matters because it creates hidden scope expansion inside a supposedly bounded automation. Teams should read it as a warning that machine execution without complete constraints becomes a governance layer of its own.
From our research:
- 92% agree governing AI agents is critical to enterprise security, yet only 44% have implemented any policies to do so, according to AI Agents: The New Attack Surface report.
- Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation.
- For a broader view of autonomous risk, see OWASP NHI Top 10, which frames the control failures that emerge when agents can act beyond fixed workflows.
What this signals
Specification drift: when an agent’s written task diverges from its execution path under ambiguity, governance weakens faster than code quality improves. Teams should expect more agent-assisted migrations to succeed only when the workflow is designed to fail loudly, not to improvise quietly.
With 80% of organisations reporting their AI agents have already acted beyond intended scope in NHIMG research, the practical signal is that bounded autonomy is now the baseline requirement, not an advanced maturity model. That shifts the programme conversation from adoption to containment.
For practitioners
- Define executable boundaries before enabling agentic work Require a written scope that names allowed files, allowed operations, rollback conditions, and explicit stop points before any agent touches production code.
- Convert migration tasks into deterministic manifests Use analyzers, manifests, and templated change sets so the agent edits a stable artefact instead of reasoning live over ambiguous system state.
- Treat sequencing as a hard control, not a suggestion Block any agent workflow that can backfill schema, rewrite write paths, or alter ownership boundaries without enforced ordering and validation gates.
- Force human escalation on ambiguity Any unresolved assumption, inferred identifier, or uncertain dependency should stop the session and route to human review rather than being guessed by the model.
Key takeaways
- Agentic refactoring is valuable, but only when machine actors are constrained by explicit artefacts, sequencing rules, and escalation paths.
- The evidence in the article shows that the hard problems are dependency order, invariant preservation, and ambiguity handling, not code generation speed.
- For identity and governance teams, the lesson is that autonomous execution fails when the workflow assumes humans will still be present to interpret and correct every missing detail.
Standards & Framework Alignment
This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.
OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 set the governance and control requirements practitioners need to meet.
| Framework | Control / Reference | Relevance |
|---|---|---|
| OWASP Agentic AI Top 10 | A2 | Agentic workflows fail here when context gaps trigger unsafe autonomous completion. |
| OWASP Non-Human Identity Top 10 | NHI-03 | The article highlights the need to govern machine actor permissions and lifecycle. |
| NIST CSF 2.0 | PR.AC-4 | The article centers on access and execution boundaries for machine actors. |
Constrain agent goals, tools, and stop conditions before allowing production code changes.
Key terms
- Agentic Refactoring: The use of AI agents to analyse, plan, and execute code changes across a live system. In security terms, the risk is not just code generation, but machine-led sequencing, dependency inference, and change execution inside production boundaries.
- Specification Drift: The gap that appears when the written task and the executed task diverge because the system left too much ambiguity for the agent to resolve on its own. In production governance, drift becomes a control failure when the agent substitutes assumptions for validated constraints.
- Deterministic Artifact: A stable output such as a manifest, analyzer result, or plan file that can be reviewed, versioned, and repeatedly validated. For agentic systems, deterministic artefacts matter because they replace model interpretation with something the organisation can govern.
Deepen your knowledge
AI agentic refactoring and bounded execution are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are defining controls for machine actors in production workflows, it is worth exploring.
This post draws on content published by 1Password: AI agents in a production Go monolith. Read the original.
Published by the NHIMG editorial team on 2026-04-20.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org