Subscribe to the Non-Human & AI Identity Journal

Notifications
Clear all

Wayback Copilot and private GitHub repos: what IAM teams need to know


(@nhi-mgmt-group)
Member Moderator
Joined: 1 year ago
Posts: 9016
Topic starter  

TL;DR: Microsoft Copilot could return data from GitHub repositories that had been public only briefly and later made private, because Bing cached those pages and exposed so-called zombie data, according to Lasso Security. The finding shows that repository privacy, secret hygiene, and retrieval permissions are still leaky at the identity layer, not just the app layer.

NHIMG editorial — based on content published by Lasso Security: Wayback Copilot: Using Microsoft’s Copilot to Expose Thousands of Private GitHub Repositories

By the numbers:

Questions worth separating out

Q: What breaks when a repository is made private after it was briefly public?

A: What breaks is the assumption that privacy changes erase prior discoverability.

Q: Why do private repositories still create secret exposure risk?

A: Private repositories still create risk because source files often contain reusable credentials, tokens, build artefacts, and internal package references.

Q: How do security teams know if cached code exposure is still active?

A: Teams should test the repository name and known file paths in major search engines and AI assistants, then compare results against current repository status.

Practitioner guidance

  • Audit for cached repository exposure Inventory repositories that were ever public, even briefly, and test whether cached copies still surface through search or copilots.
  • Scan exposed code for reusable secrets Run secret and token detection across any repository that changed from public to private, then rotate credentials that were present in indexed files, build configs, or package metadata.
  • Separate repository privacy from retrieval governance Treat indexing, caching, and AI retrieval as distinct access planes with their own controls and owners.

What's in the full report

Lasso Security's full research covers the operational detail this post intentionally leaves for the source:

  • Exact BigQuery and Bing workflow used to identify zombie repositories and cached pages
  • Examples of public-to-private repositories where Copilot still returned historical content
  • Manual validation steps for confirming whether cached pages still expose code or secrets
  • Microsoft's response timeline and the partial-remediation problem after cached-link removal

👉 Read Lasso Security's analysis of Wayback Copilot and private GitHub exposure →

Wayback Copilot and private GitHub repos: what IAM teams need to know?

Explore further

View Full Forum →  |  NHI Foundation Course →



   
Quote
(@mr-nhi)
Member Moderator
Joined: 2 months ago
Posts: 8472
 

Zombie data is a governance failure, not a search quirk: once content has been public, organisations often assume privacy changes end the exposure story. This case shows that search caches and copilots can preserve the old access state long after the source repository is locked down. The practitioner conclusion is simple: visibility is not the same as revocation, and the data lifecycle must extend beyond the originating platform.

A few things that frame the scale:

  • 300+ private tokens, keys & secrets to GitHub, Hugging Face, GCP, OpenAI, etc. were exposed, according to The State of Secrets Sprawl 2025.
  • In the same research, 4.6% of all public GitHub repositories contain at least one hardcoded secret, a signal that code exposure and credential exposure are tightly linked.

A question worth separating out:

Q: Who is accountable when an AI assistant surfaces private code from a cached repository?

A: Accountability usually spans the repository owner, the platform owner, and the team governing indexing or retrieval. The source system may be private, but the cached copy may still be live in another layer. That is why privacy incidents involving code and secrets should be handled as cross-platform access governance failures, not isolated GitHub hygiene issues.

👉 Read our full editorial: Wayback Copilot exposes private GitHub repos through cached data



   
ReplyQuote
(@mr-nhi)
Member Moderator
Joined: 2 months ago
Posts: 8472
 

Zombie data is a governance failure, not a search quirk: once content has been public, organisations often assume privacy changes end the exposure story. This case shows that search caches and copilots can preserve the old access state long after the source repository is locked down. The practitioner conclusion is simple: visibility is not the same as revocation, and the data lifecycle must extend beyond the originating platform.

A few things that frame the scale:

  • 300+ private tokens, keys & secrets to GitHub, Hugging Face, GCP, OpenAI, etc. were exposed, according to The State of Secrets Sprawl 2025.
  • In the same research, 4.6% of all public GitHub repositories contain at least one hardcoded secret, a signal that code exposure and credential exposure are tightly linked.

A question worth separating out:

Q: Who is accountable when an AI assistant surfaces private code from a cached repository?

A: Accountability usually spans the repository owner, the platform owner, and the team governing indexing or retrieval. The source system may be private, but the cached copy may still be live in another layer. That is why privacy incidents involving code and secrets should be handled as cross-platform access governance failures, not isolated GitHub hygiene issues.

👉 Read our full editorial: Wayback Copilot exposes private GitHub repos through cached data



   
ReplyQuote
Share: