Wayback Copilot exposes private GitHub repos through cached data

By NHI Mgmt Group Editorial TeamPublished 2026-03-16Domain: Best PracticesSource: Lasso Security

TL;DR: Microsoft Copilot could return data from GitHub repositories that had been public only briefly and later made private, because Bing cached those pages and exposed so-called zombie data, according to Lasso Security. The finding shows that repository privacy, secret hygiene, and retrieval permissions are still leaky at the identity layer, not just the app layer.

At a glance

What this is: Lasso Security shows that Microsoft Copilot could surface cached content from GitHub repositories that were once public and later made private, creating a zombie-data exposure path.

Why it matters: IAM and security teams need to treat brief public exposure as persistent identity and data risk, because caching can outlive repository privacy changes across NHI, autonomous, and human workflows.

By the numbers:

300+ private tokens, keys & secrets to GitHub, Hugging Face, GCP, OpenAI, etc. were exposed.

👉 Read Lasso Security's analysis of Wayback Copilot and private GitHub exposure

Context

Private repositories are only private if every discovery and retrieval layer respects the same access boundary. In this case, GitHub privacy changed, but indexed cache copies and Copilot retrieval still preserved access to content that users expected to be gone.

For identity and access teams, the problem is not just repository visibility. It is the wider trust chain around indexing, caching, delegated retrieval, and secret exposure, where a brief public state can become a durable access artifact. That makes GitHub hygiene, secret handling, and search-layer controls part of the same governance problem.

The primary keyword here is AI agent identity risk only in the broad sense of machine-mediated retrieval, not autonomous behaviour. This is an NHI and access-governance issue first, because the exposed material included code, tokens, and internal packages rather than a human authentication failure.

Key questions

Q: What breaks when a repository is made private after it was briefly public?

A: What breaks is the assumption that privacy changes erase prior discoverability. Search engines, caches, and AI retrieval layers can retain historical copies, so content may remain reachable even after GitHub shows a 404. Teams need to treat public-to-private transitions as potential exposure events until cached copies and secrets are reviewed, and they should use the 52 NHI breaches Report to understand how often hidden exposure paths matter.

Q: Why do private repositories still create secret exposure risk?

A: Private repositories still create risk because source files often contain reusable credentials, tokens, build artefacts, and internal package references. If a repository was public even briefly, those details may already be indexed or cached elsewhere. The governance problem is not only access control on GitHub, but whether any downstream system can still surface what was exposed.

Q: How do security teams know if cached code exposure is still active?

A: Teams should test the repository name and known file paths in major search engines and AI assistants, then compare results against current repository status. If deleted or private content still returns, the exposure remains active in a machine-accessible form. That signal matters because it shows discoverability has outlived the intended access window.

Q: Who is accountable when an AI assistant surfaces private code from a cached repository?

A: Accountability usually spans the repository owner, the platform owner, and the team governing indexing or retrieval. The source system may be private, but the cached copy may still be live in another layer. That is why privacy incidents involving code and secrets should be handled as cross-platform access governance failures, not isolated GitHub hygiene issues.

Technical breakdown

How cached repository data survives privacy changes

Search engines and retrieval systems can preserve a copy of content after the source repository becomes private. That means access control changes at the origin do not automatically erase indexed snapshots, cached pages, or derived search artefacts. In practice, a repository can move from public to private while the retrieval layer continues to answer from previously captured content. The result is a split-brain access model: the source of truth says no, but a downstream index still says yes. This is why privacy state and discoverability state must be treated as separate control surfaces.

Practical implication: Track cached exposure as a separate control problem, not a GitHub permission issue alone.

Why LLM retrieval can surface zombie data

LLM copilots often sit on top of search and retrieval layers, not directly on the live source system. If those layers have cached content, the model can generate answers from stale snapshots even after the original data has been withdrawn. This is not the same as the model memorising secret data during training. It is a retrieval boundary failure, where the assistant is too effective at finding historical artefacts the user thought were deleted or private. That creates a durable exposure path for code, secrets, and internal package metadata.

Practical implication: Validate what your AI assistants can retrieve from historical search caches before exposing internal repos.

Why repository exposure creates secret and dependency risk

Once private code is indexed, the exposure is not limited to source files. Repository content often includes API keys, tokens, build instructions, package names, and dependency references that can support credential abuse or dependency confusion. The security problem expands from one repository to a broader identity and supply chain issue, because secrets and internal packages become actionable even if the repository is later secured. That is why exposure analysis must include scanning for credentials, package leakage, and any internal artefact that can be reused outside the repository boundary.

Practical implication: Treat a briefly public repository as a potential secret-leak and supply-chain event until the full blast radius is mapped.

Threat narrative

Attacker objective: The objective is to recover code, secrets, and internal package details from repositories that users believe are private or deleted.

Entry occurred when repository content was public long enough to be indexed by Bing and later retrieved through cached search artefacts.
Credential access or abuse occurred when Copilot surfaced historical repository content, including material that could contain tokens, keys, and internal package references.
Impact followed as thousands of repositories, many organisations, and hundreds of secrets became reachable through a machine-mediated retrieval path even after the original content was no longer public.

Emerald Whale breach — exposed Git config files led to 15K secrets stolen and 10K repo compromises.
McKinsey AI platform breach — McKinsey AI platform hack exposed 46M chats and sensitive data.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Zombie data is a governance failure, not a search quirk: once content has been public, organisations often assume privacy changes end the exposure story. This case shows that search caches and copilots can preserve the old access state long after the source repository is locked down. The practitioner conclusion is simple: visibility is not the same as revocation, and the data lifecycle must extend beyond the originating platform.

Repository privacy alone does not govern machine retrieval: the control that failed here was the assumption that a private repository is no longer discoverable by downstream systems. Bing caching and Copilot retrieval created a parallel access path that identity teams do not usually model in access reviews. The implication is that discovery, indexing, and retrieval permissions must be treated as part of the same entitlement surface.

Secret exposure in code turns a privacy lapse into identity compromise: the article’s findings show hundreds of tokens, keys, and internal package artefacts exposed through a single retrieval mechanism. That is not just data leakage. It is a reusable identity event, because secrets are credentials, not content. Practitioners should treat public-code incidents as credential-governance failures with supply-chain consequences.

Hidden retrieval paths create an identity blast radius beyond the repository owner: the same caching logic affected thousands of repositories across many organisations, which means the blast radius is defined by indexing behaviour, not only by GitHub tenancy. This strengthens the case for cross-domain governance across code hosting, search, and AI assistants. The practical conclusion is that ownership of the repository is not enough to define accountability.

Wayback Copilot names a specific failure mode: cached-access persistence: the repository was made private, but machine access persisted through cached and indexed copies. That assumption was designed for a world where permission changes immediately collapsed discoverability. The implication is that access governance now has to account for stale machine-visible state, not just current human-visible state.

From our research:
300+ private tokens, keys & secrets to GitHub, Hugging Face, GCP, OpenAI, etc. were exposed, according to The State of Secrets Sprawl 2025.
In the same research, 4.6% of all public GitHub repositories contain at least one hardcoded secret, a signal that code exposure and credential exposure are tightly linked.
For a broader breach lens, 52 NHI breaches Report shows how exposed identities and secrets routinely become reusable attack paths rather than isolated leaks.

What this signals

Zombie-data exposure changes the governance model for code hosting. Security teams should assume that once content has been public, it may persist in search caches and AI retrieval layers beyond the repository’s current state. That makes publication control, not just current repository state, part of the control objective.

With 4.6% of all public GitHub repositories containing at least one hardcoded secret, per The State of Secrets Sprawl 2025, the real risk is not a single leaked key but the reuse potential of exposed code across search, AI, and supply-chain paths.

Cached-access persistence: this is the failure mode teams should now name internally when a private repository remains reachable through a machine retrieval layer. Once that pattern is understood, incident response can focus on search removal, secret rotation, and package-name exposure instead of only repo permission changes.

For practitioners

Audit for cached repository exposure Inventory repositories that were ever public, even briefly, and test whether cached copies still surface through search or copilots. Include archived pages, mirrored content, and indirect retrieval paths in the review.
Scan exposed code for reusable secrets Run secret and token detection across any repository that changed from public to private, then rotate credentials that were present in indexed files, build configs, or package metadata.
Separate repository privacy from retrieval governance Treat indexing, caching, and AI retrieval as distinct access planes with their own controls and owners. A private repository should not be assumed to be undiscoverable until those planes are checked.
Review internal package exposure for dependency confusion Check whether private package names, import paths, or build references were exposed in cached code and could be reused for dependency confusion or package impersonation.

Key takeaways

A repository that is public even briefly can remain discoverable through caches and AI retrieval layers after it is made private.
The scale is material, with tens of thousands of repositories and hundreds of secrets exposed in the research path described here.
The relevant control is broader than GitHub permissions alone because cached discovery, secrets, and package metadata all extend the blast radius.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Non-Human Identity Top 10	NHI-03	Cached repo exposure and secret leakage map to credential persistence and rotation gaps.
NIST CSF 2.0	PR.AC-4	Downstream retrieval access extends beyond the source repo's access control.
NIST Zero Trust (SP 800-207)	AC-6	Zero Trust requires continuous verification across search and retrieval layers, not just GitHub.

Review public-to-private repos for exposed secrets and rotate any credential that may have been indexed.

Key terms

Zombie Data: Data that users believe is private, deleted, or no longer reachable, but that still persists in caches, indexes, or downstream retrieval systems. In identity terms, the exposure outlives the source system’s access change, so governance must account for residual machine-visible copies.
Cached-access Persistence: The condition where a search engine, cache, or AI assistant can still retrieve content after the original source has been restricted. It is an access governance problem because the effective audience no longer matches the source system’s current permissions.
Repository Exposure Window: The period when a repository is public, even if only briefly, and therefore eligible for indexing, caching, or secondary retrieval. The window matters because short-lived exposure can create long-lived downstream risk once copies are captured elsewhere.
Credential Reuse Risk: The likelihood that secrets, tokens, or keys embedded in exposed code can be used again in other systems. In NHI governance, this turns a file disclosure into an identity event because the credential, not the file, is the reusable asset.

Deepen your knowledge

AI-mediated retrieval of private code and cached repository exposure are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are dealing with GitHub exposure, secret rotation, or AI search risk, it is worth exploring.

This post draws on content published by Lasso Security: Wayback Copilot: Using Microsoft’s Copilot to Expose Thousands of Private GitHub Repositories. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-03-16.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org