Ephemeral Infrastructure Identity Inheritance
TL;DR
- This article covers the complex risks when short lived cloud resources inherit permissions from their parent environments. It explores how workload identities and machine identities create security gaps during rapid scaling. You will learn strategies for managing non-human identity lifecycles and why traditional iam fails in serverless or containerized setups.
The Messy Reality of Temporary Identities
Ever wonder why a tiny container running a simple cron job suddenly has the keys to your entire production database? It's usually because we’re lazily letting temporary workloads inherit whatever permissions their "parents" have, and honestly, it’s a security nightmare.
In the old days, you gave a server an identity and it stayed there for years. Now, we have lambdas and kubernetes pods popping up and disappearing in seconds. The mess starts when these short-lived things just "inherit" the IAM role of the node or the service account they’re running on.
- Container and Lambda Rights: In many cloud setups, if you don't explicitly define a narrow identity, the workload just grabs the broad permissions of the host. In a healthcare app, a simple logging container might inherit access to pii data just because the underlying node needs it.
- Static Secrets in a Dynamic World: We’re still seeing devs hardcode api keys into container images. According to the CyberArk 2024 Machine Identity Security Report, machine identities now outnumber humans by a staggering 41 to 1, yet we’re still managing them like it's 2010.
- The Least Privilege Failure: In finance or retail, scaling fast usually wins over security. We give "admin-lite" roles to automated scripts because it’s easier than mapping out exactly what a 10-second process needs.
I've seen so many teams struggle with this because cloud providers make "default" settings so tempting. But when a dev tool in your devops pipeline inherits the ability to wipe a s3 bucket, that's not a feature—it's a massive liability.
Next, we’ll look at how this inheritance actually breaks down at the technical level.
Security Risks in the Inheritance Chain
Think about a Russian nesting doll, but instead of cute painted wood, every smaller doll is a potential back door into your cloud environment. That's basically the inheritance chain in ephemeral infrastructure, where a single "parent" identity passes its DNA down to dozens of tiny, short-lived workloads that probably shouldn't have it.
The biggest headache here is permission bloat. When you spin up a k8s pod, it often defaults to the service account of the node it’s sitting on. If that node has "S3 Full Access" because it needs to write logs, every single container on that node now has it too.
It gets worse with what I call "ghost identities." A workload finishes its job and disappears in seconds, but the IAM session or the service account token might stay valid for much longer. If an attacker grabs that token, they’re acting as a legitimate service that technically doesn't even exist anymore.
A 2023 report by Wiz revealed that 90% of cloud identities are using less than 5% of the permissions they’ve actually been granted, creating a massive surface for lateral movement.
To understand how this happens, you gotta look at the technical handshake. In aws, for example, a container asks the Instance Metadata Service (IMDS) for credentials. If you haven't set up fine-grained roles for pods (like using IRSA), the IMDS just hands over the node's broad credentials to anyone who asks from inside that VM. It’s a similar story in kubernetes where the default service account token is just automounted into every pod volume by default.
The issue isn't just that permissions are too wide, it's that we’re not treating these machine identities with the same lifecycle rigor as human ones. We wouldn't let a fired employee keep their badge for an hour, so why do we let an expired container keep its cloud permissions?
Next, we're going to look at the high-level governance frameworks that help manage these non-human identities.
Architecting for Better NHI Governance
We can't just keep crossing our fingers and hoping the default cloud settings will save us. If you're still letting every pod in your cluster run with a broad node-level identity, you aren't just "moving fast"—you're basically leaving the back door unlocked while you go on vacation.
To actually fix this, we need to stop looking at identities as static objects and start treating them like the ephemeral resources they are. That is where the NHIMG framework comes in.
I've been digging into the work over at the Non-Human Identity Management Group, and honestly, it’s a breath of fresh air for anyone tired of the "just use a secret manager" advice. They focus on the full lifecycle of a machine identity, which is exactly what’s missing in most devops pipelines.
- Lifecycle over Storage: Most teams obsess over where to store a password, but nhimg.org pushes for issuance and revocation logic. If a workload only lives for 30 seconds, its identity shouldn't live for an hour.
- Identity Federation: Instead of passing around long-lived api keys, we should be using workload identity federation. This lets your cloud provider trust your k8s cluster or github actions runner directly, exchanging a short-lived oidc token for cloud permissions.
- Attestation: You shouldn't trust a workload just because it has a token. You trust it because it can prove it's the specific build, from the specific repo, running on the specific hardware you expect.
Here is a quick look at how you’d actually implement a "token exchange" instead of using a hardcoded secret:
import boto3
def get_ephemeral_creds(oidc_token):
sts = boto3.client('sts')
# we exchange the identity token for a 15-minute session
response = sts.assume_role_with_web_identity(
RoleArn='arn:aws:iam::123456789012:role/MyScopedWorkloadRole',
WebIdentityToken=oidc_token,
RoleSessionName='EphemeralWorkloadSession'
)
return response['Credentials']
By moving toward these standards, we're finally treating machine identities with the same respect we give to our human users. It’s about building an architecture that expects things to disappear.
Next, we’re going to look at specific technical solutions like SPIFFE and how to handle identity sprawl across different industries.
Technical Solutions for Machine Identity Sprawl
Cleaning up machine identities is usually the last thing on a dev's mind when they're pushing code at 2 a.m., but honestly, it’s where the real security happens. If we don't automate the death of these identities, we're just building a digital graveyard that's still haunted by active permissions.
The gold standard for solving this right now is using spiffe (Secure Production Identity Framework for Everyone). Instead of a pod just "inheriting" a role because of where it sits, it has to prove who it is through a process called attestation.
In environments where you can't just trust a network segment, the agent (spire) looks at the kernel, the container image hash, and the namespace before handing out a short-lived SVID (spiffe Verifiable Identity Document). Think of an SVID as a temporary digital ID card—usually an X.509 certificate or a JWT—that proves exactly which workload is talking.
- Dynamic Minting: Identities are created on the fly and tied to the specific workload instance.
- Short TTLs: Most of these tokens should expire in minutes, not hours.
- Platform Agnostic: It works the same whether you're in AWS, on-prem, or some weird hybrid setup.
You really want to get away from static secrets in your ci/cd pipelines. Instead of putting an api key in GitHub Actions, you use an OIDC provider to talk to your cloud. Here is a quick way you might handle a temporary exchange for a deployment script:
import boto3
def get_cloud_access(github_token):
# This keeps secrets out of your environment variables
client = boto3.client('sts')
auth = client.assume_role_with_web_identity(
RoleArn='arn:aws:iam::account:role/DeployRole',
WebIdentityToken=github_token,
RoleSessionName='CI-CD-Runner'
)
return auth['Credentials']
Industry Use Cases and the Blast Radius
When we talk about "Blast Radius," we're measuring how much damage one compromised identity can do. If a single token gives access to everything, your blast radius is the whole company. We want to shrink that down to almost nothing.
- Finance: A high-frequency trading microservice shouldn't inherit broad network admin rights. By using SVIDs, the service can only talk to the specific database it needs, and the token expires the second the trade is processed.
- Retail: During a Black Friday surge, thousands of "auto-scaling" pods are created. Instead of each one inheriting a legacy api key, they use OIDC to get 10-minute credentials. If one pod is popped, the attacker only has a tiny window of time and very limited access.
- Healthcare: Data processing jobs shouldn't have permanent access to pii. You need a reaper process—something that monitors your orchestration layer and kills the associated nhi (non-human identity) the second the workload terminates.
At the end of the day, managing machine identity sprawl is about discipline. We have the tools to move away from messy inheritance and toward a model where every machine identity is verified, temporary, and strictly governed. That’s how you actually close the loop on ephemeral risk.