Notifications

Clear all

12,000 API Keys Exposed in LLM Training Data Leak!

Last Post

RSS

NHI Mgmt Group

(@nhi-mgmt-group)

Member Moderator

Joined: 1 year ago

Posts: 11936

Topic starter 18/12/2025 9:37 am

Executive Summary

In March 2025, a significant data breach was uncovered by Truffle Security, revealing nearly 12,000 valid API keys, passwords, and other credentials within a publicly accessible dataset from Common Crawl. This dataset, widely utilized for training large language models and generative AI systems, was found to contain sensitive secrets, including keys for major services like Amazon Web Services (AWS) and MailChimp. The breach occurred during a routine scan of approximately 400 terabytes of archived web data, covering over 2.67 billion web pages. The exposure of these credentials raises alarming concerns regarding potential misuse and the security of AI systems trained on this compromised data, highlighting the need for enhanced cybersecurity measures to protect sensitive information.

Read the full breach analysis from NHI Mgmt Group here

Key Details

Breach Timeline

Early March 2025: Truffle Security discovers the exposure of nearly 12,000 API keys and credentials during a scan.
December 2024: The dataset in question was archived, containing over 2.67 billion web pages.
Routine scans of web data led to the identification of this significant breach.

Data Compromised

Nearly 12,000 valid API keys and credentials were exposed, including keys for AWS and MailChimp.
Credentials for various cloud and web services were also found, heightening the risk of unauthorized access.
The breach impacts numerous developers and organizations relying on this data for AI training.

Impact Assessment

The exposure poses severe risks, including the potential for data leaks and unauthorized access to sensitive services.
AI systems trained on compromised data may inadvertently replicate insecure coding practices.
Increased scrutiny on AI training datasets and their security protocols is expected following this incident.

Company Response

Common Crawl is reviewing and assessing its data protection measures to prevent future breaches.
Researchers are urging AI developers to be cautious when utilizing open datasets for training models.

Security Implications

This breach underscores the necessity of robust cybersecurity protocols to protect sensitive API keys and credentials.
Organizations must implement better monitoring systems to detect unauthorized exposures in public datasets.
AI developers are encouraged to adopt secure coding practices to mitigate risks associated with compromised training data.

If you want to learn more about how to secure NHIs including AI Agents, check our NHI Foundational Training Course.

This topic was modified 7 months ago by NHI Mgmt Group

Quote

Topic Tags

Forum Statistics

11 Forums

13.2 K Topics

25.2 K Posts

40 Online

135 Members

Latest Post: ClickFix attacks: what identity and endpoint teams need to know Our newest member: Alex Recent Posts Unread Posts Tags

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Get in Touch

Quick Links

FAQ

NHI 101 Articles

Legal & Policies