How should security teams investigate malware that targets cloud workloads?

Why This Matters for Security Teams

Malware aimed at cloud workloads rarely behaves like endpoint malware. It often arrives through build pipelines, container images, exposed secrets, or compromised workload identities, then uses the cloud control plane, metadata services, and legitimate tooling to persist or spread. That means investigation has to answer two questions at once: what code ran, and what identity or trust path allowed it to operate. Without that, teams miss the real blast radius.

For practitioners, the most important shift is treating workload identity and runtime telemetry as first-class evidence. Guidance from the SPIFFE workload identity specification and NHIMG’s Guide to SPIFFE and SPIRE both point to the same operational reality: cryptographic identity, not just network location, is what tells defenders whether a workload should have been able to act. In cloud incidents, that distinction matters because a container, function, or ephemeral job may leave very little disk evidence but plenty of API and control-plane traces.

NHIMG’s research shows how often these investigations fail upstream: only 1.5 out of 10 organisations are highly confident in securing NHIs, and 85% lack full visibility into third-party vendors connected via OAuth apps. In practice, many security teams discover the malware’s true foothold only after secrets, tokens, or workload credentials have already been reused elsewhere, rather than through intentional runtime monitoring.

How It Works in Practice

A workable investigation starts with preserving volatile workload artefacts before they disappear. That includes container stdout and stderr, orchestration events, process lineage, network flow logs, cloud audit logs, and any alert context tied to the workload identity. From there, investigators reconstruct three timelines: execution, persistence, and outbound behaviour. Execution shows what spawned the malware. Persistence shows whether it wrote to startup scripts, sidecars, cron, startup hooks, or image layers. Outbound behaviour shows which services, buckets, registries, or token endpoints it reached.

Strong cloud investigations also correlate identity and authorization data. If a pod, function, or VM used an assumed role, the investigation should trace which token was issued, which policy allowed it, and whether the resulting API calls match the workload’s normal purpose. This is where workload identity becomes essential: SPIFFE/SPIRE-style identity, OIDC workload tokens, and cloud-native attestation create a chain of evidence that ties activity back to a specific workload instance instead of a vague subnet or host.

Capture logs from the orchestrator, cloud audit plane, and runtime sensor before redeploying or scaling the workload.

Map outbound calls to secrets access, object storage, identity APIs, and control-plane actions.

Check for credential use after the initial compromise window, especially short-lived tokens that may have been replayed.

Rebuild the attack path from first execution to lateral movement, not just the final alert.

NHIMG’s analysis of machine identity risk notes that 57% of organisations lack a complete inventory of their machine identities and 53% have experienced an incident tied to machine identity failures. That is why malware response in cloud environments increasingly depends on identity telemetry, not only forensics. These controls tend to break down when workloads are highly ephemeral and logs are fragmented across multiple accounts, clusters, and regions because the evidence disappears faster than responders can collect it.

Common Variations and Edge Cases

Tighter cloud runtime monitoring often increases operational overhead, requiring organisations to balance investigative depth against performance, cost, and alert fatigue. Current guidance suggests that the right level of detail depends on workload criticality, because not every service needs the same retention or sensor density.

Serverless functions, managed Kubernetes, and short-lived CI/CD runners create the hardest edge cases. In those environments, the malware may never touch a persistent disk, which makes memory capture less useful than event correlation and identity tracing. Best practice is evolving toward context-rich telemetry that combines orchestration events, IAM or workload token issuance, and outbound request logs. That is also where Shai Hulud npm malware campaign and 230M AWS environment compromise are useful reminders: cloud malware often turns stolen credentials into durable access faster than defenders can invalidate them.

There is no universal standard for how much telemetry is enough, but teams should prioritise workloads that can reach secrets stores, object storage, CI systems, and identity services. Investigations also get harder when third-party integrations are involved, because a compromised OAuth app or external service account can make malicious traffic look legitimate.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Non-Human Identity Top 10	NHI-01	Cloud malware often abuses stolen NHIs and secrets to persist or move laterally.
CSA MAESTRO		MAESTRO covers runtime governance for autonomous cloud workloads and agentic execution paths.
NIST AI RMF		AI RMF supports governance of dynamic, context-rich cloud workload behaviour and response.

Use AI RMF to assign ownership, capture evidence, and evaluate risk when workload behaviour changes unexpectedly.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

How should security teams investigate malware that targets cloud workloads?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group