The Internet Archive Breach

The Internet Archive, famous for its Wayback machine and massive digital archives, fell victim to a major data breach affecting 31 million user accounts. Unsecured authentication tokens in their GitLab repository, which were left accessible for almost two years, were the cause of this incident. Threat actors exploited these tokens to gain access to critical systems, databases, and user data. The breach also involved the theft of sensitive support tickets, with some containing personal identification details. Furthermore, attackers sent phishing emails and impersonated the company using the compromised Zendesk API token.

Incident Timeline

December 2022

The Internet Archive’s GitLab repository was accidentally configured to expose an authentication token. This vulnerability remained undetected for 22 months, representing a significant operational security gap.

October 9, 2024

The attacker gained access to The Internet Archive’s source code, which contained sensitive data, by exploiting the exposed GitLab token, The attacker was able to escalate his privileges and exfiltrate 7TB of data. The full user database and other internal records were included in this data.

October 2024

At the same time, A Distributed Denial of Service (DDOS) attack hit The Internet Archive, the attack was conducted by a different threat actor. The timing of the attack caused confusion about the purpose of this attack, even though it had nothing to do with the data breach.

October 20, 2024

The attacker exploited Zendesk credentials and secrets found in the exfiltrated data to send phishing emails to the users. These emails seemed to be legitimate, taking advantage of the user’s confidence to trick them to reveal more personal information.

Late October 2024

The stolen data was shared on underground forums, exposing millions of users to a high risk of phishing and fraud.

Incident Analysis

Root Cause

An unsecured and improperly managed GitLab repository was the reason for the breach. The authentication token, which is necessary for accessing the internal system, was embedded in a configuration file inside the repository, making it vulnerable for exploitation.
The extended exposure of the token without revoking or rotating it, allowed the attacker to gain unauthorized access to the system.

Attack Technique

Initial Access

An exposed authentication token found in a misconfigured GitLab repository was the starting point of the attack. This token gave the attacker access to The Internet Archive’s internal environment.

Access Vector – the token embedded in the configuration file, acted as a hardcoded credential. It lacked expiration or rotation policies, enabling the attacker to authenticate as a legitimate service or user.
Lateral Movement – after getting in, the attacker looked through the repository and took out sensitive credentials embedded in the source code files. These credentials included access keys to critical infrastructure parts such as databases, internal APIs, and potentially cloud storage systems. As a result, the attacker was able to escalate the level of privilege and access more internal systems.

Data Exfiltration

The attacker succeeded in exfiltrating nearly 7TB of data. This amount of data included sensitive user information, hashed passwords, and internal operational files.

Zendesk Exploitation

Credential Abuse – the attacker used API keys extracted from the compromised repository to authenticate Zendesk’s platform. These keys allowed the attacker to access and control legitimate Zendesk features.
Phishing Campaign – using access to the Zendesk platform, which handles The Internet Archive’s support ticket communication, the attacker sent emails that appeared to come from The Internet Archive legitimate support email system. These emails bypassed the email security protocols like DKIM, DMARC, and SPF.
Target Exploitation – the attacker tricked the recipients of these phishing emails to get more information, such as account credentials or personal details.

Motivations Behind the Attack

The attacker’s primary motivation for this attack is to get more credits and make a strong reputation in underground forums and among the cybersecurity community.

Impact of the Breach

On Users – Personal Data Exposure

User Database: with over 31 million user records were leaked, exposing usernames, emails, and passwords. This could lead to credential stuffing attacks.
Support Tickets: Personal details in more than 800,000 support tickets were also leaked. These tickets included sensitive inquiries, uploaded identification documents, and potentially private user communications. This information could be used in fraudulent activities, including identity theft, opening financial accounts or filing tax returns under the victim’s names.

On The Internet Archive

Reputational Damage

The internet archive is a widely trusted nonprofit organization, known for preserving digital history; it may experience a serious loss of trust from users. This could discourage future support from donors and users, especially if people see that the organization can protect their information.

Financial Costs

Costs afforded from hiring cybersecurity experts, implementing enhanced security measures, and conducting security audits.
Potential fines from data protection authorities and class-action lawsuits from affected users.

Recommendations

Use secure vaulting solutions to manage and rotate secrets.
Conduct regular security audits to identify any potential vulnerabilities.
Deploy tools to detect leaked credentials in real-time.
Implement advanced anomaly detection systems to identify any suspicious behaviour.
Enforce immediate password resets for all affected users.
Develop a comprehensive incident response plan.