Ollama GPU Acceleration Configuration: CUDA, ROCm, and Metal Platform Guide

When I first ran a 7B model locally, I used pure CPU. The experience? Less than two characters per second - I could finish half a cup of coffee waiting for it to complete a sentence. Later, I got an RTX 3080, and with the same model and parameters, the speed jumped to over 40 tokens per second - roughly a 50x difference.

That’s not all. Larger models, longer contexts, multi-turn conversations - CPU basically can’t handle these. GPU acceleration isn’t just nice to have, it’s the difference between usable and unusable.

If your computer has a graphics card - whether NVIDIA, AMD, or Apple Silicon - there’s a good chance it can accelerate Ollama. But how to configure it? Each platform has different pitfalls. NVIDIA users have it easiest - just install drivers. AMD users need to deal with ROCm, and Windows users need to use Vulkan. Mac users have it best - nothing to configure.

This article will cover configuration methods, common pitfalls, and troubleshooting approaches for all three platforms in one go.

Why GPU Acceleration Matters

Let’s start with data. Based on testing, the inference speed difference for 7B models across different hardware is substantial:

Acceleration Method	Typical Performance (7B Model)	Use Case
CPU-only inference	0.5-2 tokens/sec	Testing, debugging
NVIDIA CUDA	30-80 tokens/sec	Daily use, production
Apple Metal	20-50 tokens/sec	Mac users
AMD ROCm	25-60 tokens/sec	Linux AMD users

Why such a huge gap? Simply put, GPUs excel at “repetitive work.” Large model inference is essentially matrix multiplication - trillions of matrix multiplications. CPU doing this is like having a PhD student calculate math problems one by one - accurate but slow. GPU? Thousands of workers doing it together, each handling a small piece. Individually they’re not as smart, but there’s strength in numbers.

Then there’s memory bandwidth. How fast inference runs largely depends on how quickly data can be sent to compute units. GPU memory bandwidth is typically several times higher than CPU - RTX 3080 has 912 GB/s, while typical DDR4 memory is only around 50 GB/s. Data stuck in traffic means fast computation is useless.

So when do you need GPU? Basically, running models larger than 7B requires it. Chat, coding, long text generation - without GPU, the experience will be terrible. If you’re just occasionally playing around or debugging a small model, CPU might suffice.

NVIDIA CUDA Configuration Guide

NVIDIA is the most hassle-free choice. Mature ecosystem, comprehensive documentation, abundant community experience - people have already stepped on all the pitfalls for you.

Hardware and Driver Requirements

Not all NVIDIA graphics cards work. Ollama requires Compute Capability 5.0 or higher. What does that mean? Check this table:

Compute Capability	Representative Cards	Works?
8.9	RTX 4090/4080/4070	Perfect
8.6	RTX 3090/3080/3070	Perfect
7.5	RTX 2080 Ti/2080	Perfect
6.1	GTX 1080 Ti/1080	Works
5.2	GTX 980 Ti/980	Works
Below 5.0	GTX 7xx and older	Not supported

Driver version also has requirements. Official requirement is 531+ (Windows) or 535+ (Linux). Too low, and CUDA won’t run.

Verification and Installation Steps

First, confirm your graphics card is recognized by the system. Run this in terminal:

nvidia-smi

If you can see graphics card information, driver version, and CUDA version, you’re good. If it says “command not found”, the driver isn’t installed or the path is wrong.

Ollama automatically detects CUDA after installation. No extra configuration needed, just make sure the driver is working. Run a model to test:

ollama run llama3.2
ollama ps

You should see GPU information in the ollama ps output, like:

ID      MODEL           SIZE      PROCESSOR    UNTIL
abc123  llama3.2:7b     4.7 GB    100% GPU     2 minutes from now

If it shows CPU instead of GPU, there’s a problem.

Common Pitfalls

Wrong driver version. Download the latest driver from NVIDIA’s website. Linux users should be careful not to install the wrong version - some distribution default drivers are too old.

Missing CUDA Toolkit. Actually, Ollama doesn’t need the full CUDA Toolkit - it comes with a stripped-down version. But some system configurations are special and might need manual CUDA runtime installation. On Linux:

# Ubuntu/Debian
sudo apt install nvidia-cuda-toolkit

Running Ollama in containers. Docker users need to add the --gpus all flag to let the container access the GPU:

docker run --gpus all ollama/ollama

AMD ROCm Configuration Guide

AMD users have more work to do. ROCm (AMD’s CUDA alternative) isn’t as mature as CUDA, but has improved significantly in the past two years. Linux configuration is relatively smooth, but Windows requires some workarounds.

Which AMD Cards Work?

ROCm has best support for RDNA architecture:

Architecture	Series	Support Level
RDNA3	RX 7900 XTX/XT, RX 7800/7700	Best
RDNA2	RX 6800/6700/6600	Good
RDNA1	RX 5700/5600/5500	Usable
GCN	RX Vega, RX 500/400	Not officially guaranteed

Basically, RX 7000 and 6000 series are fine, 5000 series work okay, and older cards shouldn’t be relied upon.

Linux ROCm Installation

Ubuntu/Debian users follow these steps:

# Confirm system support first
sudo apt update

# Install ROCm core
sudo apt install rocm-dkms rocm-dev rocm-libs

# Install HIP runtime
sudo apt install hip-runtime-amd

# Verify installation
rocminfo

If rocminfo shows graphics card information, you’re set. Then restart once to let the kernel module load properly.

Ollama automatically detects ROCm after installation. Like CUDA, no extra configuration needed.

What About Windows Users?

ROCm’s Windows support is still in development. But there’s an alternative - Vulkan. Just set an environment variable:

# Windows PowerShell
$env:OLLAMA_VULKAN = "1"
ollama run llama3.2

Vulkan performance isn’t as good as ROCm, but it works. Real-world testing shows about 70-80% of ROCm speed.

Multi-GPU Selection

If you have multiple AMD GPUs, you can specify which one to use:

# Use only first GPU
export ROCR_VISIBLE_DEVICES=0

# Use first and third GPUs
export ROCR_VISIBLE_DEVICES=0,2

Performance Comparison

AMD has officially and community-tested some data. RX 7900 XTX (AMD flagship) runs 7B models at about 35-45 tokens/sec, while RTX 4090 (NVIDIA flagship) reaches 50-70 tokens/sec. There’s a gap, but the price difference is even larger - 7900 XTX is about 40% cheaper.

From a price-performance perspective, AMD users should take the time to set up ROCm.

Apple Metal Zero-Configuration Experience

Mac users have it easiest. Ollama’s support for Apple Silicon is zero-configuration - install Ollama, run it, GPU acceleration automatically kicks in.

Which Macs Work?

All Apple Silicon Macs are supported:

M1 / M1 Pro / M1 Max / M1 Ultra
M2 / M2 Pro / M2 Max / M2 Ultra
M3 / M3 Pro / M3 Max
M4 series

Intel Macs don’t support Metal acceleration, only CPU. But Intel Macs are about ready for retirement anyway.

Automatic Detection Mechanism

Ollama automatically detects Metal at startup. No configuration files, environment variables, or driver installations needed - Apple has deeply integrated Metal into the system.

Verify it:

ollama run llama3.2
ollama ps

The output should show GPU, like:

PROCESSOR: 100% GPU

If you see CPU, there’s a problem. But honestly, this is rare on Mac.

What’s the Performance Like?

Base M2 runs 7B models at about 25-35 tokens/sec. Pro/Max versions are faster because they have more GPU cores. Testing shows M2 Max can reach around 45 tokens/sec, comparable to mid-range NVIDIA cards.

One detail: Apple Silicon uses unified memory architecture - GPU and CPU share system memory. The benefit is VRAM isn’t limited, the downside is running large models eats a lot of memory. M2 8GB can run 7B models okay, 14B is pushing it, 70B is out of the question.

Common Misconceptions

Many people think Mac needs Metal configuration - it doesn’t at all. Ollama’s official code already has Metal detection logic, automatically enabled after installation.

Others ask about installing ROCm or CUDA - Mac doesn’t use these at all. Metal is Apple’s own technology, built into the system.

Multi-GPU and VRAM Management

If you have multiple GPUs, or insufficient VRAM, this section is crucial.

Layer Distribution Mechanism

Large models don’t run entirely on GPU. They’re split into many “layers” - some on GPU, the rest on CPU. This ratio is dynamically calculated - Ollama automatically decides how many layers go on GPU based on available VRAM.

For example: a 7B model has about 80 layers. If your GPU has 8GB VRAM, maybe 60 layers are on GPU, 20 on CPU. If VRAM is insufficient, more layers overflow to system memory.

Pack vs Spread Mode

Multi-GPU environments have two strategies:

Pack Mode (default): Try to fit the model into one GPU, overflow to another. Good when GPU performance differs significantly.
Spread Mode: Distribute evenly across all GPUs. Good when GPU performance is similar.

Enable Spread mode:

export OLLAMA_SCHED_SPREAD=1

Honestly, most people can use the default Pack mode. Spread mainly has advantages in VRAM utilization but is more complex to configure and requires experience to tune.

What If VRAM Is Insufficient?

Running large models is most problematic when VRAM isn’t enough. Several solutions:

1. Use quantized models. Q4_K_M quantization can compress 7B model VRAM usage from 14GB to about 4GB, with only about 5-10% performance loss. Very worthwhile.

# Pull quantized version
ollama pull llama3.2:7b-q4_K_M

2. Reduce context length. Long conversations, large documents occupy lots of VRAM. If it’s just simple Q&A, shorter context is fine.

3. Multi-GPU distribution. Two 8GB cards combined are more usable than one 16GB card - because each card has its own compute units.

Dynamic Allocation Logic

Ollama manages this automatically, no need to manually specify layer count. But if you want to force adjustments, you can modify model parameters (advanced usage, most people don’t need it).

Troubleshooting Guide

You’ll always encounter issues when configuring GPU acceleration. Here’s a compilation of common troubleshooting approaches.

GPU Detection Issue Checklist

Check in order:

Confirm driver installation
```
# NVIDIA
nvidia-smi

# AMD
rocminfo
```
If there’s an error, install drivers first.

Confirm Ollama version

ollama --version

Very old versions might not support certain GPUs. Update:

# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download latest installer from official website

Check CUDA/ROCm version
```
# NVIDIA CUDA version
nvcc --version

# ROCm version
rocm-smi
```
Ollama requires CUDA 12.3+ or ROCm 6.0+. Upgrade if version is wrong.

Restart service

# Linux
sudo systemctl restart ollama

# macOS/Windows
# Kill process and restart

Some configuration changes need restart to take effect.

GPU Disappears After Sleep

Both Mac and Windows have this problem - GPU acceleration fails after waking from sleep.

Solutions:

Mac: Restart Ollama service, or restart computer
Windows: Check if driver is normal, reload if necessary
Linux: Generally doesn’t have this problem, but sometimes needs to manually wake GPU

Container GPU Permission Issues

Linux users running Ollama in Docker might encounter SELinux permission issues.

Solution:

# Temporarily disable SELinux (not recommended for long-term use)
sudo setenforce 0

# Or properly configure Docker GPU access
docker run --gpus all --security-opt seccomp=unconfined ollama/ollama

Other Common Issues

“out of memory” error: Model is too large, not enough VRAM. Use quantized version or switch to smaller model.

Inference speed didn’t improve: Confirm ollama ps shows GPU. If it shows CPU, troubleshoot the issues above.

AMD GPU not working: First confirm ROCm is installed correctly. Windows users try Vulkan mode.

Summary

After all this, how to choose?

Your Hardware	Recommended Solution	Configuration Difficulty
NVIDIA GPU	CUDA auto-enable	Low, just install drivers
AMD GPU + Linux	ROCm	Medium, requires manual installation
AMD GPU + Windows	Vulkan	Low, set environment variable
Apple Silicon	Metal auto-enable	Very low, zero configuration
Intel Mac or no GPU	Pure CPU	No configuration needed, but very slow

Simply put: NVIDIA users have it easiest, Mac users are happiest, AMD users on Linux are fine but Windows requires workarounds, and those without GPU… better find a way to get one.

GPU acceleration isn’t optional optimization, it’s a basic requirement for running LLMs locally. Once configured, the experience difference is a qualitative leap.

NVIDIA CUDA GPU Acceleration Configuration

Configure Ollama GPU acceleration on NVIDIA graphics cards for high-speed large model inference

⏱️ Estimated time: 10 min

1
Step1: Verify graphics card and driver
Run the `nvidia-smi` command to view graphics card information, driver version, and CUDA version. If there's an error, the driver isn't installed or there's a path configuration issue.
2
Step2: Install or update driver
Download the latest driver from NVIDIA's website. Linux users should note that distribution default drivers may be too old. Windows requires driver 531+, Linux requires 535+.
3
Step3: Start Ollama and test
Run `ollama run llama3.2` to start the model, then execute `ollama ps` to check processor status. If it shows GPU percentage, acceleration is working.
4
Step4: Troubleshoot issues (if needed)
If it shows CPU, check if CUDA Toolkit is missing (Linux users can install nvidia-cuda-toolkit), Docker users need --gpus all flag, or restart the Ollama service.

FAQ

Does Ollama support AMD graphics cards?

Yes. Linux users can use ROCm, while Windows users need to set the OLLAMA_VULKAN=1 environment variable to enable Vulkan mode. RDNA2 and RDNA3 architectures are best supported.

How can I confirm GPU acceleration is enabled?

After running `ollama run model-name`, execute `ollama ps` and check if the PROCESSOR column shows a GPU percentage. If it shows 100% GPU, acceleration is working.

What if I don't have enough VRAM for large models?

Three solutions: use quantized models (e.g., Q4_K_M reduces 7B model VRAM usage from 14GB to 4GB), reduce context length, or use multi-GPU distribution.

Do Mac users need to configure Metal?

No. Apple Silicon Macs automatically enable Metal acceleration after installing Ollama with zero configuration required. Just ensure you have M1/M2/M3/M4 series - Intel Macs can only use CPU.

What NVIDIA graphics card version is required?

Compute Capability 5.0 or higher (GTX 960 and newer). Driver version needs to be 531+ on Windows and 535+ on Linux. Ollama automatically detects CUDA after installation.

9 min read · Published on: May 16, 2026 · Modified on: May 17, 2026

default

AI & Intelligence