Subscribe to the Non-Human & AI Identity Journal
Home FAQ Governance, Ownership & Risk Who is accountable when an AI assistant surfaces…
Governance, Ownership & Risk

Who is accountable when an AI assistant surfaces private code from a cached repository?

← Back to all FAQ
By NHI Mgmt Group Editorial Team Updated June 9, 2026 Domain: Governance, Ownership & Risk

Accountability usually spans the repository owner, the platform owner, and the team governing indexing or retrieval. The source system may be private, but the cached copy may still be live in another layer. That is why privacy incidents involving code and secrets should be handled as cross-platform access governance failures, not isolated GitHub hygiene issues.

Why This Matters for Security Teams

When an AI assistant resurfaces private code from a cached repository, the issue is rarely a single misconfiguration. It usually reflects a control gap across source control, indexing, retrieval, and the assistant layer itself. That means ownership is shared, but accountability still needs a primary decision-maker. The practical question is not whether the code was once private, but whether cached copies were governed as sensitive data across every system that could replay them.

This is why security teams should treat the event as an access-governance failure, not just a content leak. The same pattern shows up in incidents like the DeepSeek breach and the GitLocker GitHub extortion campaign, where exposure can move from one layer to another faster than human review can keep up. The NIST Cybersecurity Framework 2.0 is useful here because it frames governance, asset visibility, and response as connected responsibilities rather than isolated controls.

In practice, many security teams encounter the blast radius only after an assistant has already answered with material that was never meant to be retrievable in the first place.

How It Works in Practice

Accountability usually follows the control plane that made the exposure possible. The repository owner is responsible for classifying and protecting the source material. The platform owner is responsible for caching, indexing, retention, and deletion behavior. The team operating retrieval or agent tooling is responsible for how search, embeddings, and prompt-time access checks are enforced. If any one of those layers keeps stale data, the AI assistant can surface it even after the original repository was fixed.

Operationally, this means teams need to ask four questions:

  • Was the repository copied into a cache, vector store, or search index with its own retention rules?
  • Could the assistant retrieve data based on similarity or prior embeddings without checking current authorization?
  • Was the private code protected by the same access policy in the downstream system as in the source system?
  • Was deletion propagated everywhere, or only removed from the original repository?

This is where current guidance suggests using NIST CSF style governance alongside explicit data-lineage tracking. NHIMG research on the state of secrets in AppSec shows why this matters: leaked secrets remain hard to remediate, and AI systems can reproduce sensitive patterns from codebases. That makes the retrieval layer a real security boundary, not just a convenience feature.

Where this guidance breaks down is in distributed developer platforms that sync code into multiple caches, because deletion and access revocation often do not propagate uniformly across those replicas.

Common Variations and Edge Cases

Tighter retrieval controls often increase latency and engineering overhead, so organisations have to balance user experience against the risk of resurfacing private code. There is no universal standard for this yet, especially when assistants span code search, ticketing, and chat histories. That means some teams will adopt strict deny-by-default retrieval, while others will allow broader recall for productivity and rely on post-hoc monitoring.

The main edge cases are usually operational, not theoretical. A repository may be private, but its cached fragments can remain in:

  • search indexes that were not rebuilt after access changes
  • embedding stores that outlive the source repository
  • support transcripts or agent memory layers that retain snippets
  • third-party indexing services with their own retention and deletion timelines

Best practice is evolving toward context-aware authorization at retrieval time, plus explicit lifecycle rules for cached code and secrets. That approach helps separate who owns the source from who is accountable for downstream exposure. The Emerald Whale breach is a reminder that once sensitive material is replicated into another system, the exposure pattern can change faster than the original control owner expects.

For security leaders, the practical answer is that accountability should be assigned to the platform owner for the cache, with shared remediation obligations across the repository and retrieval teams.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
OWASP Non-Human Identity Top 10NHI-05Cached code exposure often stems from overbroad non-human retrieval paths.
CSA MAESTROGOV-02Shared accountability across repo, cache, and agent layers needs explicit governance.
NIST AI RMFAI RMF governance addresses accountability for harmful model outputs and data exposure.

Restrict NHI access paths to least privilege and verify every retrieval source before returning code.

NHIMG Editorial Note
Reviewed and updated by the NHIMG editorial team on June 9, 2026.
NHI Mgmt Group — the #1 independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org