By NHI Mgmt Group Editorial TeamPublished 2026-05-07Domain: Breaches & IncidentsSource: Cyera

TL;DR: Cyera says an unauthenticated out-of-bounds heap read in Ollama, tracked as CVE-2026-7482 with CVSS 9.1, can expose prompts, system messages, environment variables, and other sensitive heap data through only three API calls on roughly 300,000 internet-facing instances. The deeper issue is that local AI runtimes can become high-value NHI exposure points when authentication, segmentation, and secret hygiene are missing.


At a glance

What this is: Cyera's research describes a critical unauthenticated memory leak in Ollama that can expose prompts, secrets, and other heap data with three API calls.

Why it matters: For IAM and NHI teams, this shows how a local AI inference engine can turn into an unmanaged identity and data exposure path when access controls are absent.

👉 Read Cyera's technical report on the Ollama memory leak and NHI exposure


Context

Local AI platforms are increasingly being deployed as shared inference layers, but many of them inherit weak defaults around network exposure, authentication, and data handling. In NHI terms, that means the platform itself becomes a sensitive machine identity boundary, because prompts, environment variables, API keys, and model artifacts can all converge in one execution path.

Cyera's report centers on a memory-safety flaw in Ollama's model creation flow, where a crafted GGUF file can force reads beyond the intended buffer and capture heap-resident data. That matters because the problem is not only code execution or service disruption, but disclosure of non-human identity secrets and user content from systems that often sit close to development and agentic AI workflows.

This starting point is unfortunately typical for fast-moving AI infrastructure: deployment convenience outpaces identity governance, and the asset is treated like a developer tool rather than a production data path.


Key questions

Q: How should security teams secure internet-facing local AI inference servers?

A: Security teams should require authentication in front of every inference endpoint, remove public exposure where possible, and segment AI workloads from general-purpose networks. Local AI servers often process prompts, secrets, and tool outputs in the same runtime, so access control, egress limits, and patching all need to be enforced together.

Q: Why do local AI platforms increase NHI secret exposure risk?

A: They increase risk because prompts, system instructions, API keys, and environment variables can coexist in the same process memory. If a vulnerability exposes heap data or an operator leaves the service unauthenticated, a single compromise can reveal multiple non-human identity secrets and user conversations at once.

Q: What breaks when AI runtimes are deployed without authentication?

A: Without authentication, the service becomes a reachable trust boundary rather than a controlled internal capability. Attackers can probe endpoints, submit crafted payloads, and use built-in export or push functions to move leaked data out of the environment. In practice, unauthenticated access turns one software bug into a much larger exposure problem.

Q: What should teams do in the first 24 to 72 hours after exposure is found?

A: Contain the endpoint, apply the patch or block external access, and rotate any secrets that may have been loaded into memory. Then review logs, artifact exports, and agent integrations to determine whether prompts, tokens, or proprietary code were exposed before the fix was applied.


Technical breakdown

How an out-of-bounds heap read leaks AI runtime data

The core issue is an out-of-bounds heap read, also called an uninitialised or boundary-violating memory disclosure depending on implementation details. In this case, the parser accepts a GGUF file that claims tensor dimensions larger than the actual payload, then the quantization pipeline reads past the real buffer. Because heap memory contains recent process data, the attacker may capture prompts, system instructions, environment variables, and other in-process secrets without needing to break authentication. The danger is amplified in local AI runtimes because the same process may handle user content, model data, and credentials in one memory space.

Practical implication: Treat memory-safety flaws in AI runtimes as potential secret-exposure events, not only availability bugs.

Why unauthenticated inference endpoints become high-risk NHI surfaces

A locally hosted LLM platform is not just a model server. It is a machine identity boundary that often receives sensitive inputs from developers, internal users, and agentic tools, then returns outputs that may be pushed, stored, or forwarded elsewhere. When such a service listens on all interfaces without authentication, every exposed instance becomes a reachable trust edge. In NHI governance terms, that breaks the assumption that non-human access is already constrained by network placement alone. The endpoint can now be used as a conduit for both data disclosure and credential harvesting.

Practical implication: Inventory AI runtimes like you would any internet-facing identity service and force authenticated access in front of every endpoint.

How push-based exfiltration turns disclosure into a complete compromise path

The report's three-step chain is important because it shows how an attacker can move from memory disclosure to durable exfiltration without complex post-exploitation tooling. First, the crafted file triggers the read. Second, model creation fills the resulting artifact with leaked heap data. Third, the built-in push function sends that artifact to an attacker-controlled server. This pattern matters for agentic AI environments because built-in transfer features can become data smuggling channels when the platform trusts its own outputs more than the content it processed.

Practical implication: Review every AI platform feature that exports, syncs, or publishes artifacts as a potential exfiltration path.


Threat narrative

Attacker objective: The attacker aims to extract in-memory AI conversation data and host secrets from an internet-facing local inference server.

  1. Entry occurs through a crafted GGUF file submitted to an exposed Ollama instance with no authentication.
  2. Escalation happens when model creation forces an out-of-bounds heap read and copies sensitive in-memory data into the generated artifact.
  3. Impact follows when the built-in push feature sends the tainted model file to an attacker-controlled server, exposing prompts, secrets, and environment variables.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.


NHI Mgmt Group analysis

Unauthenticated local AI runtimes create an identity problem before they create a memory problem. The headline vulnerability is a heap read, but the operational failure is broader: the platform is reachable without a trust gate. That means secrets, prompts, and agent outputs can be harvested from a service that never should have been exposed as a public endpoint. Practitioners should treat local inference systems as governed NHI infrastructure, not convenience tooling.

Memory disclosure in AI infrastructure creates identity blast radius, not just data spill. When environment variables and tokens sit in the same runtime as prompts and model artifacts, a single flaw can expose multiple classes of NHI secrets at once. That collapses the usual separation between application data, credentials, and execution context. The practical conclusion is to reduce what any one AI process can see, store, and export.

Push features and model export paths are the hidden exfiltration layer in agentic AI stacks. Many teams focus on prompt filtering while overlooking the built-in ability to package and move artifacts outward. In an AI runtime, that means a disclosure vulnerability can become a clean outbound channel with very little attacker effort. Security teams should assume that any export or sync function can be abused until it is explicitly constrained.

Identity governance for AI infrastructure now has to include memory-safety exposure. Conventional IAM does not detect or contain out-of-bounds reads, but it does define who can reach a service and what that service can reach in turn. That makes access proxies, segmentation, and secret scoping part of the same control plane as patching. The control objective is to reduce the blast radius before a memory bug becomes a credential incident.

Cyera's findings point to a wider pattern of unmanaged AI endpoints becoming shadow AI infrastructure. The issue is not limited to one product, because any unauthenticated inference service with artifact export and secret-rich workloads can produce similar exposure. That suggests a category-wide governance gap in discovery, authentication, and egress control. Teams should reclassify public AI runtimes as high-risk NHI assets and manage them accordingly.

From our research:

What this signals

Identity blast radius: local AI runtimes can concentrate prompts, secrets, and export functions in one place, so the programme risk is no longer limited to model quality. With 85% of organisations lacking full visibility into third-party vendors connected via OAuth apps, according to The State of Non-Human Identity Security, the same visibility gap now applies to AI endpoints and the identities that feed them.

Teams should expect more incidents where a memory bug, an integration mistake, or a public endpoint exposes both data and credentials in one event. That means asset discovery, authentication enforcement, and egress monitoring need to be treated as a single control set rather than separate projects. The practical signal is simple: if you cannot name every exposed inference service, you do not yet control the NHI surface.

The programme implication is that AI infrastructure needs the same discipline as privileged workloads. Align endpoint access with Zero Trust Architecture, limit standing exposure, and review whether any service account or token can reach an inference layer without a clear business reason. A local AI server with open network reach is a governance failure before it is a vulnerability.


For practitioners

  • Patch and verify the fix immediately Apply the vendor-released remediation, then verify that tensor element counts are validated against actual buffer sizes before any quantization loop runs. Confirm the patch is active in every environment that exposes Ollama or a similar local inference service.
  • Remove unauthenticated network exposure Place an authentication proxy or API gateway in front of every AI inference endpoint and block public access to default ports such as 11434. If patching is delayed, restrict ingress at the firewall and confirm no instance is reachable from the public internet.
  • Rotate secrets from exposed hosts Assume environment variables, API keys, and tokens may have been resident in memory if the service was internet-facing. Rotate credentials, review service account bindings, and invalidate any token that could have been embedded in heap data.
  • Audit agentic integrations and export paths Review Claude Code, LangChain, and any other tooling that routes prompts or artifacts through the inference layer. Pay particular attention to model push, file export, and sync functions that can move leaked data outside the trust boundary.
  • Segment AI workloads and constrain egress Isolate AI servers on dedicated network segments with strict outbound controls so a disclosure event cannot be turned into effortless exfiltration. Pair segmentation with logging that can show which data moved, when it moved, and which identity path was used.

Key takeaways

  • Local AI inference servers can expose prompts and secrets when memory safety and authentication controls fail together.
  • Cyera's report shows a three-call unauthenticated path that can turn heap disclosure into practical exfiltration.
  • Security teams should treat AI runtimes as NHI assets and enforce authentication, segmentation, and secret rotation by default.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
OWASP Agentic AI Top 10NHI-01Covers agent-facing exposure and prompt or tool-data leakage
OWASP Non-Human Identity Top 10NHI-03Credential rotation is directly relevant after in-memory secret exposure
NIST Zero Trust (SP 800-207)PR.AC-4Zero Trust access and segmentation fit unauthenticated AI runtimes

Rotate any credential that could have been resident in heap memory and review token lifetime.


Key terms

  • Non-Human Identity: A non-human identity is a machine, workload, service account, token, certificate, or agent that authenticates and acts in an environment. In AI systems, these identities often carry broad access, long lifetimes, and weak visibility, which makes governance and secret handling central to security.
  • Unauthenticated inference endpoint: An unauthenticated inference endpoint is an AI service that accepts requests without proving the caller's identity. It is especially risky when exposed on public networks because anyone can submit payloads, trigger processing, and potentially reach data stored in memory, logs, or exported artifacts.
  • Heap memory disclosure: Heap memory disclosure occurs when a program reads data outside the intended buffer and returns whatever sits in process memory. In AI runtimes, that can expose prompts, system instructions, tokens, and other secrets that were never meant to leave the service.
  • Identity blast radius: Identity blast radius is the amount of access, data, and downstream systems exposed when a single non-human identity or runtime is compromised. It is a practical way to think about how far one leaked token, exposed endpoint, or memory bug can spread inside an enterprise.

Deepen your knowledge

AI runtime exposure and non-human identity secret handling are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are securing local inference platforms or agentic workflows, it is worth exploring.

This post draws on content published by Cyera: Bleeding Llama, a critical memory leak in Ollama. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-05-07.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org