Nearly 12,000 Secrets Found in Public LLM Training Dataset

In early March 2025, security researchers at Truffle Security discovered that the publicly‑available dataset from Common Crawl, widely used to train large language models and other generative AI systems, contained nearly 12,000 valid API keys, passwords, and other credentials.

These exposed secrets include keys for major services such as Amazon Web Services (AWS), MailChimp, and other cloud or web services.

Because Common Crawl’s archive is openly accessible and frequently used by AI developers worldwide, the findings raise serious concerns — not only about credential exposure, but also about the possibility that AI models trained on this data could have indirectly learned insecure coding practices or even accidentally reproduce sensitive secrets.

What Happened

Researchers from Truffle Security scanned roughly 400 terabytes of web‑archive data from Common Crawl’s December 2024 snapshot, covering data from over 2.67 billion web pages.

From this scan, they identified 11,908 secrets that still worked, meaning they were live and could authenticate successfully.

Many of these secrets resulted from poor developer practices: credentials were hard‑coded into front‑end HTML or JavaScript, or otherwise embedded directly within web content instead of being stored securely on servers or in proper secret management systems.

Some specific findings:

Around 1,500 unique MailChimp API keys were found embedded directly in pages’ HTML or JavaScript.
One API key tied to another service (WalkScore) appeared a staggering 57,029 times across 1,871 different subdomains, illustrating how deeply some credentials were exposed and reused.
The dataset included a variety of secret types, in total, researchers cataloged 219 distinct secret types.

Following the discovery, Truffle Security reached out to affected vendors and assisted in revoking and rotating many of the compromised keys.

Why This Matters

This disclosure has serious implications for both AI development and security hygiene more broadly:

Real‑world credential exposure: Live API keys and passwords exposed inadvertently can be exploited, leading to unauthorized access, data theft, resource misuse, or financial fraud.
Risk to AI model integrity and trust: Because many AI models are trained in part using Common Crawl data, there is a risk that sensitive secrets may have been ingested during training. Even if filtering is applied, there’s no guarantee that all secrets are removed, potentially leading to flawed or risky model behavior.
Widespread scale and reuse: The bulk and duplication of exposed secrets (some reused across hundreds/thousands of pages or subdomains) magnify the potential damage and make cleanup harder.
Supply‑chain shockwave: Applications, websites, and services that embed or reuse hard-coded credentials contribute to an vulnerable ecosystem, bad practices in one place can affect many downstream systems.

What Organizations & Developers Should Do

Given the seriousness of these findings, here are immediate actions and best practices to adopt:

Audit public code and web assets: Search through website HTML, JS, and front‑end assets for embedded credentials; treat any found secret as potentially compromised and rotate immediately.
Use secure secret management: Never hard‑code secrets in front‑end code. Use environment variables, secret vaults, or server‑side storage.
Scan data sources used for AI training: If using web‑scraped datasets (or third‑party data) as training inputs, run secret‑scanning tools to filter out API keys, credentials, webhooks and other sensitive information.
Rotate and revoke exposed credentials: Upon discovery of leaks, especially public ones, revoke and replace credentials immediately.
Treat non-human identities and tokens with care: As with human credentials, service accounts, tokens and keys deserve lifecycle management, least‑privilege access, and regular audits.
Educate developers and teams about credential hygiene: Make secret hygiene a standard part of coding guidelines, especially for front‑end or publicly visible code.

How NHI Mgmt Group Can Help

Incidents like this underscore a critical truth, Non-Human Identities (NHIs) are now at the center of modern cyber risk. OAuth tokens, AWS credentials, service accounts, and AI-driven integrations act as trusted entities inside your environment, yet they’re often the weakest link when it comes to visibility and control.

At NHI Mgmt Group, we specialize in helping organizations understand, secure, and govern their non-human identities across cloud, SaaS, and hybrid environments. Our advisory services are grounded in a risk-based methodology that drives measurable improvements in security, operational alignment, and long-term program sustainability.

We also offer the NHI Foundation Level Training Course, the world’s first structured course dedicated to Non-Human Identity Security. This course gives you the knowledge to detect, prevent, and mitigate NHI risks.

If your organization uses third-party integrations, AI agents, or machine credentials, this training isn’t optional; it’s essential.

Final Thoughts

The discovery of nearly 12,000 valid API keys and passwords in an AI training dataset should be a wake‑up call for both the AI community and developers worldwide. It shows how a combination of careless coding practices and unfiltered public data can lead to a massive exposure of sensitive credentials, with potential for widespread abuse.

As organizations continue to build, deploy, and train AI models and as developers ship more code to the web, secret safety and credential hygiene must become non‑negotiable. Overlooking these basics isn’t just negligence, it invites attackers to walk right in.