Switch Language
Toggle Theme

Ollama GPU Acceleration: Complete Guide for CUDA, ROCm & Metal

Watching the text crawl across my terminal line by line, I couldn’t help but check the time—47 seconds. That’s how long it took to run Llama 3 8B on my old laptop. CPU maxed out, fans screaming, roughly 5 tokens per second. Honestly, the experience was pretty discouraging. I wanted to use it for coding assistance, but waiting for a response took longer than Googling the answer myself.

Later, I moved the same GPU to my desktop, installed CUDA drivers, ran the same model with the same parameters—3 seconds.

No exaggeration, just from nearly a minute down to a few seconds. That “I asked, now I want the answer” feeling finally came back.

This article will help you set up Ollama GPU acceleration. Whether you’re on NVIDIA, AMD, or Apple Silicon, I’ll walk you through configuration, verification, and troubleshooting. Save those waiting hours for something more interesting.

How Good is GPU Acceleration: The Real Gap from 30 Seconds to 3 Seconds

Let’s cut to the chase: running local LLMs on GPU delivers 10-20x speed improvement. This isn’t marketing hype—it’s my real-world testing data and the consensus across the community.

You might ask: can’t CPU run these models? Why bother with GPU setup?

It can run, but the experience is completely different. A CPU running a 7B model manages 3-8 tokens per second, meaning a 500-word response takes 20-60 seconds. Switch to GPU, and the same model hits 40-80 tokens per second, done in a few seconds. This gap isn’t just “faster”—it’s the difference between “barely usable” and “actually usable.”

Does Your GPU Support It?

Ollama supports three GPU platforms, each with its own requirements:

NVIDIA GPUs: The most mainstream and hassle-free option. Official requirement is Compute Capability 5.0 or higher, which basically means GTX 900 series and later. GTX 1060, RTX 3060, RTX 4090—all good. I have an RTX 3060 12GB that handles models under 14B parameters without breaking a sweat.

AMD GPUs: Slightly more involved setup, but runs just as well. Linux requires ROCm v7, Windows currently only has ROCm v6.1 preview. Supported models are also limited—RX 6000 and RX 7000 series are the safest bets, older cards need some extra configuration.

Apple Silicon: M1/M2/M3/M4 all supported, and it’s automatic. Mac users basically don’t need any configuration—install Ollama and Metal acceleration kicks in. After 2026, there’s also the MLX backend option, pushing performance even higher.

Is Your VRAM Enough?

This is something many people overlook. GPU model inference has one hard requirement: VRAM.

Let’s do some quick math: a 7B model with 4-bit quantization needs roughly 5-6GB VRAM, 14B needs 10-12GB, and 70B requires 40GB+. Your GPU’s VRAM directly determines what model size you can run. My RTX 3060 12GB runs Llama 3 8B comfortably, but Mixtral 8x7B is tight—I have to offload some layers to CPU.

So before configuring GPU, know your card’s model and VRAM capacity. It sets your expectations.

NVIDIA CUDA: The Most Hassle-Free Solution (With a Few Caveats)

If you’re using NVIDIA, congrats—your setup might be the simplest of the three platforms.

First, Check If Drivers Are Installed

Open your terminal and run:

nvidia-smi

If you see a table with GPU model, VRAM size, and driver version, your drivers are good to go. Ollama will automatically detect and use CUDA—no need to install CUDA Toolkit separately. Yep, you read that right. Ollama bundles the necessary CUDA libraries, saving you a step.

If the command isn’t found, you’ll need to install drivers first. Ubuntu users can run:

sudo apt install nvidia-driver-535  # or newer version

Reboot after installation, then verify with nvidia-smi again.

What About Multiple GPUs?

If your machine has multiple GPUs (say, two RTX 3090s), Ollama by default spreads the model across all cards. But sometimes you want to specify which card to use—maybe one card runs the model while another handles something else.

Set an environment variable:

# Use only GPU 0
export CUDA_VISIBLE_DEVICES=0

# Use GPU 0 and GPU 2
export CUDA_VISIBLE_DEVICES=0,2

Add this line to ~/.bashrc or your systemd service config for persistence.

Running Ollama in Docker?

Some folks like putting all services in containers—Ollama can too. But note: Docker containers can’t access host GPU by default, needs extra configuration.

Use NVIDIA’s official nvidia-container-toolkit:

# Install toolkit
sudo apt install nvidia-container-toolkit

# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Then when starting the Ollama container, add the --gpus all flag:

docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

How to Confirm GPU Is Actually Being Used?

A simple verification: while running a model, open another terminal and run:

ollama ps

The output shows the currently running model and GPU usage. If you see something like GPU: 100%, acceleration is working.

You can also use nvidia-smi -l 1 for real-time VRAM monitoring—VRAM should spike noticeably when the model runs.

AMD ROCm: Slightly More Setup, Runs Just As Fast

I feel the AMD struggle—most tutorials online are NVIDIA-focused, and AMD documentation is scattered. But good news: Ollama’s ROCm support has stabilized. Just a few more steps.

The Right Path for Linux Users: ROCm v7

If you’re on Ubuntu 22.04 or newer, ROCm installation isn’t too painful:

# Add AMD official repository
sudo apt update
sudo apt install amdgpu-install
sudo amdgpu-install --usecase=rocm

# Add yourself to render group
sudo usermod -aG render,video $USER

# Reboot to apply
sudo reboot

After reboot, verify with rocminfo command. If you see your GPU info, Ollama is ready to install. AMD version of Ollama auto-detects ROCm environment—no extra config needed.

Windows Users: Still Preview Territory

Honestly, Windows ROCm support isn’t mature yet. Official ROCm v6.1 preview exists but supports limited GPU models and isn’t as stable as Linux. If you mainly work on Windows with an AMD card, my suggestion:

Prioritize WSL2 + Ubuntu. Running Ollama in Linux subsystem delivers much better performance and stability than native Windows.

Older GPUs? HSA_OVERRIDE to the Rescue

AMD GPU architecture codenames get complicated. ROCm officially supports newer architectures (gfx900, gfx1030, etc.). If your card is older, like RX 580 (gfx803), ROCm won’t recognize it by default.

Use an environment variable to force override:

export HSA_OVERRIDE_GFX_VERSION=10.3.0  # Force gfx1030 compatibility mode

This doesn’t work in all cases, but community feedback shows it helps many older cards. Give it a try—if it fails, you’re back to CPU.

Multi-GPU Configuration

Similar to NVIDIA, AMD also supports specifying which GPUs to use:

export ROCR_VISIBLE_DEVICES=0,1  # Use GPU 0 and GPU 1

Check GPU numbers with rocm-smi command.

Common AMD GPU Architecture Codenames

Here are common models and their architectures for troubleshooting:

GPU ModelArchitectureROCm Support
RX 7900 XTXgfx1100Native
RX 6800 XTgfx1030Native
RX 5700 XTgfx1010Native
RX 580gfx803Needs HSA_OVERRIDE
Vega 56/64gfx900Native

Overall, AMD setup has a few more pitfalls than NVIDIA, but once it works, performance is comparable.

Apple Metal: Hidden Bonus for Mac Users

If you’re using Apple Silicon Mac (M1/M2/M3/M4), here’s the good news: you don’t need to configure anything.

Really, nothing at all. Install Ollama, run a model, GPU acceleration activates automatically. Apple’s Metal framework is built into Ollama—the system automatically loads models onto GPU.

M-Series Chip Performance

Based on community testing, Mac local LLM performance is actually quite decent:

  • M1/M2 8GB: Running 7B models, around 15-20 tok/s
  • M2 Pro 16GB: Running 14B models, hits 25-30 tok/s
  • M3 Max 36GB: Running 30B+ models, maintains 30+ tok/s

Compared to older CPU-only machines, this speed is practical. Not quite RTX 4090 “instant response” level, but perfectly fine for coding assistance, translation, and polishing.

2026 Bonus: MLX Backend

If you’re on M-series chips with 32GB+ unified memory (like M3 Max, M4 Pro), you can also enable the MLX backend—Apple’s machine learning framework optimized for their silicon.

According to developer community data, MLX backend boosts inference speed by 93%. What does that mean? Running Llama 3 8B at 57.8 tokens per second becomes 111.4 tok/s with MLX. That’s the difference from “pretty smooth” to “actually fast.”

Enabling it is simple—just add a parameter:

ollama run llama3 --backend mlx

Note that MLX currently has high memory requirements—32GB below might be unstable. Also, only some models support MLX, mainly those published in mlx format.

How to Confirm GPU Is Working?

Open Activity Monitor, switch to GPU History tab. When running a model, you should see GPU usage spike. If only CPU moves and GPU stays flat, Metal might not be enabled—rare, but reinstalling Ollama usually fixes it.

GPU Detection Failed: Common Troubleshooting

Hardware configuration issues are almost inevitable. Here are problems I’ve encountered and their solutions—hopefully saves you some detours.

Problem 1: no compatible GPUs were discovered

Most common error—Ollama can’t find a usable GPU.

Possible causes:

  • Drivers not installed or too old
  • GPU model unsupported (like GTX 700 series)
  • Docker container lacks GPU access permissions

Troubleshooting steps:

# NVIDIA: verify driver
nvidia-smi

# AMD: verify ROCm
rocminfo

# If commands fail, install drivers first

Problem 2: Not compiled with GPU offload support

This error means your downloaded Ollama version lacks GPU support.

Solution: Re-download the correct version from the official site. AMD users note: Ollama has a dedicated ROCm version with a different download link than CUDA. Don’t download the wrong one.

Problem 3: NVIDIA Driver Version Too Old

Ollama requires NVIDIA driver 450 or higher. If your system runs 400-series old drivers, CUDA won’t work.

# Check current driver version
nvidia-smi | grep "Driver Version"

# If too old, update driver
sudo apt install nvidia-driver-535

Problem 4: AMD amdgpu Driver Missing

Linux AMD GPUs need amdgpu driver for ROCm. Some systems default to older radeon driver, which doesn’t support ROCm.

# Check currently loaded driver
lsmod | grep amdgpu

# If no output, install manually
sudo apt install amdgpu-dkms

Problem 5: SELinux Blocking Container GPU Access

Ran into this on CentOS/RHEL systems. SELinux default policy blocks container access to GPU devices.

Quick fix:

sudo setenforce 0  # Temporarily disable SELinux

Permanent fix requires adjusting SELinux policy—complex, check Red Hat official docs. Or just switch to Ubuntu, simpler.

Verification Commands Summary

Here’s a quick command checklist for troubleshooting:

# 1. Check if GPU recognized by system
nvidia-smi       # NVIDIA
rocminfo         # AMD
system_profiler SPDisplaysDataType  # macOS

# 2. Check Ollama process status
ollama ps

# 3. Real-time GPU monitoring (while running model)
watch -n 1 nvidia-smi    # NVIDIA
rocm-smi -a              # AMD

# 4. Check environment variables
echo $CUDA_VISIBLE_DEVICES
echo $ROCR_VISIBLE_DEVICES

Most issues can be pinpointed with these commands. If still stuck, search Ollama GitHub Issues—plenty of people have been down these rabbit holes.

Final Thoughts

After all this, here’s a quick reference table:

PlatformPrerequisitesSetup DifficultyRecommendation
NVIDIADriver 450+Easy (basically zero config)First choice
AMD (Linux)ROCm v7Medium (a few commands)Second choice
AMD (Windows)ROCm v6.1 PreviewHarder (suggest WSL2)Average
Apple SiliconNone requiredSimplestFirst choice for Mac

GPU acceleration—configure once, benefit long-term. The gap between CPU “barely runs” and GPU “actually works” is massive. What platform is your GPU on? Any issues during setup? Share in the comments, I’ll reply when I can.

Ollama GPU Acceleration Setup

Complete GPU acceleration configuration for three platforms

⏱️ Estimated time: 30 min

  1. 1

    Step1: Verify GPU Model and Drivers

    Choose verification method based on your GPU type:

    • NVIDIA: Run nvidia-smi to view GPU info
    • AMD: Run rocminfo to confirm ROCm detection
    • macOS: No verification needed, Metal auto-enables
  2. 2

    Step2: NVIDIA CUDA Configuration

    Simplest solution, just install drivers:

    1. Install driver: sudo apt install nvidia-driver-535
    2. Reboot system
    3. Verify: nvidia-smi should show GPU info
    4. Ollama auto-detects CUDA, no extra config needed
  3. 3

    Step3: AMD ROCm Configuration (Linux)

    Requires ROCm v7 installation:

    1. Install: sudo apt install amdgpu-install
    2. Configure: sudo amdgpu-install --usecase=rocm
    3. Permissions: sudo usermod -aG render,video $USER
    4. Reboot and verify: rocminfo
    5. Older GPUs may need: export HSA_OVERRIDE_GFX_VERSION=10.3.0
  4. 4

    Step4: Verify GPU Acceleration is Active

    Confirm GPU is working while running model:

    • Run ollama ps to check GPU usage
    • NVIDIA: nvidia-smi -l 1 for real-time monitoring
    • AMD: rocm-smi -a for real-time monitoring
    • macOS: Activity Monitor GPU History
  5. 5

    Step5: Multi-GPU Environment Setup

    Specify which GPUs to use:

    • NVIDIA: export CUDA_VISIBLE_DEVICES=0,2
    • AMD: export ROCR_VISIBLE_DEVICES=0,1
    • Add environment variable to ~/.bashrc for persistence

FAQ

What GPU platforms does Ollama support?
Ollama supports three platforms: NVIDIA (CUDA, Compute Capability 5.0+), AMD (ROCm v7 Linux / v6.1 Windows Preview), and Apple Silicon (Metal auto-enabled). NVIDIA is easiest to configure, Apple Mac requires zero configuration.
What if my GPU VRAM isn't enough?
VRAM determines model size: 7B models need 5-6GB, 14B needs 10-12GB, 70B needs 40GB+. With insufficient VRAM: 1. Use smaller quantized models; 2. Let Ollama automatically use CPU for memory (slower); 3. Use multiple GPUs to distribute load.
How do I confirm GPU acceleration is working?
Run ollama ps command while model is running to check GPU usage. NVIDIA users can use nvidia-smi -l 1 for real-time VRAM monitoring. If GPU usage rises, acceleration is active.
Can older AMD GPUs like RX 580 work?
Some older GPUs can be force-enabled with environment variable. Set export HSA_OVERRIDE_GFX_VERSION=10.3.0 to force gfx1030 compatibility mode. This doesn't work in all cases, requires testing.
How to use GPU in Docker containers?
NVIDIA requires nvidia-container-toolkit installation, then configure Docker runtime. Add --gpus all parameter when starting container. AMD container GPU config is more complex, suggest running Ollama directly on host.
How to enable Apple Silicon MLX backend?
Add --backend mlx parameter when running model, e.g., ollama run llama3 --backend mlx. MLX requires 32GB+ unified memory and only some models support it. Performance improvement is about 93%.
What if I get 'no compatible GPUs were discovered' error?
Troubleshoot in order: 1. Check if drivers installed (nvidia-smi or rocminfo); 2. Check if driver version too old (NVIDIA needs 450+); 3. Check if Docker container configured GPU access; 4. Check if AMD missing amdgpu driver.

10 min read · Published on: Apr 25, 2026 · Modified on: Apr 25, 2026

Related Posts

Comments

Sign in with GitHub to leave a comment