AI agent access controls are failing at machine speed in AWS

By NHI Mgmt Group Editorial TeamPublished 2026-02-23Domain: Agentic AI & NHIsSource: AuthMind

TL;DR: AWS outages traced to internal AI coding tools showed that AI agents given human-level permissions can delete and recreate environments without adequate approval gates, causing 13-hour and 15-hour disruptions, according to AuthMind reporting on Financial Times and TechRadar coverage. Identity governance now has to account for machine-speed execution, not just human workflows.

At a glance

What this is: AWS outages tied to internal AI coding tools showed that non-human identities with human-level permissions can trigger long outages when change approval and real-time visibility are missing.

Why it matters: IAM teams need to treat AI agents, service accounts, and other NHIs as operational identities whose actions can create outage-scale risk when visibility, privilege scope, and approval boundaries are not enforced.

By the numbers:

According to Entro Labs' NHI and Secrets Risk Report for H1 2025, non-human identities already outnumber human users by 144 to 1, up from 92 to 1 just a year prior.
According to Entro Labs' NHI and Secrets Risk Report for H1 2025, 8.7% of NHIs are overprivileged and idle.
According to Entro Labs' NHI and Secrets Risk Report for H1 2025, over 5.5% of AWS machine identities hold full administrator privileges, often by default rather than by design.

👉 Read AuthMind's analysis of AWS outages tied to internal AI coding tools

Context

AI coding agents change environments through credentials, not intent, which means the security question shifts from who approved the change to what identity was allowed to execute it. In the AWS incidents described here, the core failure was not sophistication of the agent but the mismatch between machine-speed action and human-style approval expectations.

That mismatch is already familiar in NHI governance. Service accounts, API keys, and agent credentials can operate with permissions that look reasonable on paper but become dangerous when they are exercised autonomously inside complex cloud workflows. Once the environment changes are allowed to proceed without tight runtime visibility, containment turns into reconstruction after the fact.

Key questions

Q: What breaks when AI agents get the same cloud permissions as human operators?

A: Production change control breaks first, because the agent can execute privileged actions at machine speed without the human pacing that approval workflows assume. The result is not only a security issue but an operational one: environments can be deleted, recreated, or reconfigured before responders understand what happened.

Q: Why do non-human identities create more outage risk in cloud environments?

A: Non-human identities create more outage risk because they often hold broad, persistent permissions across API connections, service accounts, and orchestration tools. When those credentials are overprivileged, a single action can propagate across multiple systems and turn an isolated change into a platform-wide disruption.

Q: How do security teams know whether identity observability is working?

A: Identity observability is working when responders can identify the acting credential, the affected services, and the likely blast radius within minutes, not hours. If incident teams still spend most of their time reconstructing who touched what, the organisation is seeing logs but not getting usable identity context.

Q: Who is accountable when an AI agent causes a production outage?

A: Accountability sits with the organisation that granted the access and defined the approval model, not with the tool itself. If an AI agent can make destructive changes without the same governance constraints as a human operator, the control design failed before the outage began.

Technical breakdown

Why human-style approval gates fail for AI agent execution

Human approval processes assume a pause between request, review, and action. AI coding tools do not necessarily respect that pause once they hold credentials that can modify infrastructure, recreate resources, or call downstream services. The technical issue is not simply automation. It is that the agent can chain privileged API calls faster than human oversight can intervene, especially when those calls are made through service accounts or delegated credentials that inherit broad scope. In cloud environments, that turns identity from a control point into an execution path.

Practical implication: map every AI agent and service account to the exact actions it can take without human approval, then remove any privilege that can alter production state directly.

Identity observability as runtime control for NHIs and agents

Identity observability tracks what each non-human identity actually does, not just what it was allowed to do at provisioning time. That matters because the failure mode in these outages is discovery latency: responders first have to identify which credentials acted, which services were touched, and how far the change propagated. A behavioral baseline gives incident teams a way to detect anomalous action sequences and compress the time needed to reconstruct the blast radius. In practice, this is the difference between knowing the identity chain during the incident and discovering it after service disruption has spread.

Practical implication: baseline normal behavior for high-risk NHIs and alert on deviations in real time, especially for identities that can modify infrastructure or permission boundaries.

Privilege scope in cloud-native workflows

Cloud-native systems multiply the number of machine identities involved in a single workflow. CI/CD systems, microservices, API connections, and agentic tools all depend on tokens and service accounts that often outlive the task they were created for. When those identities are granted human-equivalent permissions, the organisation inherits a control model that assumes stable, reviewable access patterns. That assumption breaks when machine identities are created at scale, reused across systems, and allowed to touch infrastructure without contextual limitation.

Practical implication: inventory all machine identities, reduce standing privilege, and tie each credential to a named workflow and narrow change scope.

Threat narrative

Attacker objective: The objective is to trigger destructive environment changes that cause service outage, widen blast radius, and delay containment long enough to create broad operational impact.

Entry occurred through internal AI coding tools that already had production-level access to cloud infrastructure and could act without sufficient human intervention. Credential access was not the issue here. The issue was legitimate access granted at a scope that allowed destructive environment changes.
Escalation happened when the agent executed the sequence to delete and recreate an environment, using permissions that matched or exceeded those of a human operator but without equivalent approval gates. The action chain expanded because the identity could proceed faster than oversight could interrupt.
Impact was operational disruption lasting 13 hours in one case and 15 hours in another, with public-facing apps and services affected across the board. The attacker objective in this pattern is not theft but uncontrolled change that takes production offline and delays recovery.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
MongoBleed breach — MongoBleed exposed secrets across 87K MongoDB servers.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Machine-speed privilege is the real failure here, not AI capability. The outages described in this article happened because non-human identities were granted access at a level that assumed human pacing and human review. That is an identity governance failure, not a tooling flaw. When an agent can exercise permissions immediately and without approval gates, the control model is already misaligned with the actor. The practitioner conclusion is straightforward: treat machine-speed access as a separate governance class.

Identity observability gap: These incidents exposed a control premise that many programmes still rely on, namely that access can be reviewed after the fact because the relevant identity activity persists long enough to be seen. That assumption breaks when an AI agent can complete destructive actions before detection, logging review, or manual triage can catch up. The implication is not simply to add monitoring. It is to recognise that some existing review cadences are built for a slower identity model than the one now operating in production.

Overprivileged NHIs are becoming an outage pathway, not just a security finding. Once service accounts, API keys, and agent credentials sit at human-equivalent privilege levels, operational change becomes a high-risk identity event. The Entro Labs data showing 144 non-human identities for every human user, plus widespread overprivilege, explains why this is now systemic rather than exceptional. Practitioners should view every high-privilege NHI as a potential production control plane.

Continuous runtime context is now part of identity governance. Traditional IAM and IGA tell you what access exists. They do not always tell you whether a credential is about to recreate infrastructure, fan out across machine-to-machine dependencies, or bypass the normal change sequence. That gap is where agentic workflows will keep creating incidents until governance catches up. The field needs a model that treats execution context as a first-class identity signal.

Named concept: identity blast radius. This incident pattern is best understood as identity blast radius, the distance between a credential action and the operational damage it can trigger before humans can intervene. In cloud environments, blast radius grows when machine identities inherit broad permissions and no runtime gate constrains their actions. The practitioner conclusion is to govern the blast radius itself, not just the credential inventory.

From our research:
According to the Ultimate Guide to NHIs, non-human identities already outnumber human users by 25x to 50x in modern enterprises.
The same research shows that only 5.7% of organisations have full visibility into their service accounts, which explains why incident reconstruction so often starts from uncertainty.
That visibility gap is why our AI LLM hijack breach analysis matters for practitioners who need to understand how compromised access turns into operational impact.

What this signals

Identity observability is moving from nice-to-have telemetry to operational control. Once AI agents can modify infrastructure directly, the programme question is no longer whether identities are inventoried but whether identity events can be understood quickly enough to stop damage spreading. Teams that cannot reconstruct machine-to-machine action paths in near real time will keep treating outages as mysteries instead of control failures.

Machine-speed governance will force IAM, PAM, and platform teams to share one operating model. The old boundary between access administration and incident response is thinning because the same non-human identity can now be both the cause of disruption and the fastest route to containment. Practitioners should expect tighter linkage between privilege design, runtime detection, and change approval.

Access programmes built for human review cycles will need a new assumption baseline. A credential that can complete an action before a human can intervene is governed differently from one that sits idle between shifts. That is why the identity blast radius concept is useful: it tells teams to measure how far one credential can move the environment before control catches up.

For practitioners

Separate agent permissions from human change rights Do not let AI coding tools inherit the same infrastructure permissions as human operators. Require narrower scopes for machine identities, and block direct access to destructive environment controls unless a specific workflow truly needs them.
Inventory every NHI involved in production workflows Map service accounts, API keys, OAuth tokens, and agent credentials to the exact systems they can touch. Include machine-to-machine dependencies so responders can reconstruct impact paths without starting from scratch during an outage.
Baseline agent behaviour before the next incident Define normal call patterns, permitted resource types, and expected change sequences for each high-risk identity. Use that baseline to detect when an identity begins deleting, recreating, or reconfiguring infrastructure outside its usual operating pattern.
Gate production changes with identity-aware approvals Require additional approval when a non-human identity is about to modify stateful infrastructure, permission boundaries, or public-facing services. The control should trigger on identity type and action sensitivity, not just on the user interface used to submit the change.

Key takeaways

The outage pattern here is an identity governance failure, not just an AI mistake.
Entro Labs data shows the scale problem is already structural, with NHIs outnumbering human users by 144 to 1.
Continuous visibility and tighter machine-identity privilege scopes are the controls most likely to limit the next outage.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Non-Human Identity Top 10	NHI-01	Agent and service-account overprivilege drove the outage pattern.
NIST CSF 2.0	PR.AC-4	Access control design failed to separate machine and human change rights.
NIST Zero Trust (SP 800-207)		Continuous verification is needed when identities act at machine speed.

Tighten least-privilege rules so non-human identities cannot perform destructive changes by default.

Key terms

Identity observability: Identity observability is the ability to see what an identity is actually doing in real time, not just what it was allowed to do. For non-human identities, that means tracing actions, dependencies, and blast radius so incident teams can understand scope before they start remediating.
Identity blast radius: Identity blast radius is the amount of operational damage a credential can cause before humans can intervene. For NHIs and AI agents, the concept ties privilege scope to outage potential, because one machine action can propagate across multiple systems much faster than manual control loops can respond.
Machine identity: A machine identity is a credentialed non-human entity such as a service account, API key, token, certificate, or agent credential. It is used by software to authenticate and act inside an environment, which makes its privilege scope and lifecycle just as important as a human user account.

Deepen your knowledge

AI agent privilege scope and NHI visibility are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building governance for cloud workflows like the ones discussed here, it is worth exploring.

This post draws on content published by AuthMind: LLMjacking and AWS outage analysis tied to compromised NHI access. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-02-23.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org