Switch Language
Toggle Theme

Ollama GPU Scheduling and Resource Management: VRAM Optimization, Multi-GPU Load Balancing

Ever experienced this? You have an 8GB VRAM GPU, finally manage to load a 13B model, run a few inferences, and suddenly it crashes—staring at a “CUDA out of memory” error, feeling thoroughly frustrated.

Or maybe you invested in two GPUs, excited to finally run large models, only to open nvidia-smi and discover one card is doing all the work while the other sits idle.

Honestly, I encountered these exact issues when I started using Ollama. Insufficient VRAM, unused multi-GPU setups, inconsistent inference speeds—these problems kept me up for several nights. Through trial and error, I gradually understood Ollama’s GPU scheduling logic. Turns out, many things aren’t as simple as “configure it and it works”—you need to understand the principles behind the parameters.

This article consolidates those experiences to help you solve several practical problems:

  • How to run 13B models stably on 8GB VRAM (without sudden OOM crashes)
  • How to configure multi-GPU setups to actually use all cards (complete load balancing solution)
  • Which parameters to adjust when VRAM is insufficient (with priority ranking)
  • What GPU offloading actually is (llama.cpp underlying mechanism)

First, a caveat: this article is fairly technical. You’ll need some understanding of GPU, CUDA, and basic Ollama operations. If you’re new to Ollama, I recommend reading the earlier articles in this series (especially part 6 on performance optimization basics). The context will make this article much easier to follow.

1. GPU Memory Management Mechanism: Complete Parameter Configuration Guide

Ollama’s GPU scheduling centers on a few parameters that control how model layers are distributed between GPU and CPU. Understanding these parameters explains why VRAM errors occur even when you seem to have enough memory, or why inference speeds mysteriously slow down.

1.1 Core Parameters Explained

Let’s start with the most important parameters. I’ve organized them into a table for easy reference:

ParameterFunctionDefaultWhen to Adjust
num_gpuHow many model layers run on GPUAuto-detectReduce when VRAM is insufficient
main_gpuPrimary GPU index0Specify which GPU to use in multi-GPU setups
low_vramLow VRAM mode togglefalseEnable for VRAM under 8GB
num_batchBatch processing size512Reduce to 256 when VRAM is tight
num_ctxContext length4096Use 2048 for short conversations to save VRAM

The num_gpu parameter is the most confusing. It doesn’t mean how many GPUs you have—it means how many model layers to run on the GPU.

For example: Llama 2 7B has 32 layers. If you set num_gpu: 32, all 32 layers run on the GPU. If VRAM is insufficient and you change it to num_gpu: 20, then 20 layers run on the GPU while the remaining 12 must be computed by the CPU—naturally slowing down the speed.

The low_vram parameter is interesting. When enabled, Ollama uses techniques to save VRAM, such as placing KV cache in CPU memory instead of GPU VRAM. The tradeoff is slower inference speed, but at least it won’t crash.

1.2 VRAM Allocation Process

When Ollama loads a model, VRAM allocation follows this process:

  1. Detect VRAM: Check available GPU VRAM
  2. Calculate layers: Determine how many layers can fit on GPU based on model size and VRAM
  3. Allocate KV cache: Reserve space for inference cache (this also uses VRAM)
  4. Start inference: Dynamic VRAM usage with fluctuations

The key is step two—Ollama automatically calculates the optimal layer distribution. However, sometimes this automatic calculation isn’t accurate enough, especially when VRAM is just barely sufficient (like running a 13B model on 8GB VRAM). In these cases, you need to manually specify num_gpu.

Want to know how many layers are using GPU offloading for the current model? Use this command:

ollama run llama3 --verbose

The output will include a line like llama_model_load: model loaded - layers: 40/40 on GPU, indicating all 40 layers are on the GPU.

1.3 llama.cpp Backend Mechanism

Ollama uses llama.cpp as its inference engine. Understanding llama.cpp’s GPU offloading logic explains why sometimes parameter adjustments have minimal effect.

GPU Offloading Decision

llama.cpp calculates like this:

Available VRAM = Total GPU VRAM - System reserved (~few hundred MB)
Layer size = Model parameters / Number of layers
Layers that fit = min(Total layers, Available VRAM / Layer size)

There’s a pitfall in this calculation: it only considers VRAM used by the model itself, not accounting for KV cache. KV cache is used during inference and grows with conversation length. So sometimes the model loads successfully, but after a few inferences, KV cache explodes the VRAM, causing a crash.

75%
Q4 quantization VRAM savings
来源: Real-world comparison: FP16 → Q4_K_M

Hybrid Computing Architecture

GPU and CPU don’t have completely separate tasks. Roughly:

  • GPU handles: Matrix calculations, attention operations (high computational load)
  • CPU handles: Embedding, normalization operations (low computational load)
  • Data transfer: Data moves back and forth between GPU and CPU, incurring overhead

If you only put some layers on the GPU, data transfer overhead becomes significant—after each layer completes, the next layer is on a different device, requiring data transfer first. This is why partial GPU offloading significantly slows down inference speed.

mmap Memory Mapping

llama.cpp uses mmap by default to load model files. Benefits include:

  • No need to load entire model into memory; OS loads on demand
  • Multiple processes can share the same memory
  • Lower memory footprint

If you want to disable mmap (sometimes problematic), set in Modelfile:

PARAMETER use_mmap false

2. Multi-GPU Configuration: Complete Load Balancing Architecture

If you have two or more GPUs, the biggest headache is: how do you make Ollama use them all?

First, a disappointing fact: Ollama doesn’t support model parallelism. Meaning, you can’t split one model in half, with half running on GPU 0 and the other half on GPU 1. Each model can only bind to one GPU.

So what’s the use of multiple GPUs? Two use cases:

  1. Run different model instances: GPU 0 runs llama3, GPU 1 runs mistral
  2. Run multiple instances of the same model: For load balancing, increasing throughput

2.1 Single Instance Multi-GPU (Limitations and Configuration)

If you just want Ollama to recognize multiple GPUs, the simplest way is using the CUDA_VISIBLE_DEVICES environment variable:

# Only let Ollama use GPU 0 and GPU 1
CUDA_VISIBLE_DEVICES=0,1 ollama serve

However, this configuration has a problem: Ollama defaults to placing the model on GPU 0, leaving GPU 1 idle. You can use the main_gpu parameter to specify the primary GPU:

# Modelfile
FROM llama3
PARAMETER main_gpu 1  # Set primary GPU to GPU 1

But honestly, this approach is limited—you’re just switching which card runs the model, not truly utilizing both cards’ capabilities.

The real way to leverage multi-GPU power is running multiple Ollama instances, binding one instance per GPU, then using a load balancer to distribute requests.

The architecture looks like this:

┌─────────┐
│ Client  │  Sends inference request
└────┬────┘

┌────▼────────────────────┐
│ Nginx (Load Balancer)   │  least_conn strategy
│ Port: 8080              │
└────┬─────────┬──────────┘
     │         │
┌────▼───┐ ┌──▼────┐
│Ollama 1│ │Ollama 2│
│GPU 0   │ │GPU 1   │  Each instance has exclusive GPU access
│Port    │ │Port    │
│11434   │ │11435   │
└────────┘ └────────┘

Step 1: Start Multiple Ollama Instances

# Instance 1 - Bind to GPU 0, port 11434
CUDA_VISIBLE_DEVICES=0 OLLAMA_HOST=127.0.0.1:11434 ollama serve &

# Instance 2 - Bind to GPU 1, port 11435
CUDA_VISIBLE_DEVICES=1 OLLAMA_HOST=127.0.0.1:11435 ollama serve &

Note: Ollama’s default data directory is ~/.ollama, and both instances will share the same model storage. This is fine because mmap memory mapping allows multiple processes to share the same model file.

Step 2: Configure Nginx Load Balancing

# /etc/nginx/conf.d/ollama.conf
upstream ollama_cluster {
    least_conn;  # Least connections priority strategy
    server 127.0.0.1:11434;
    server 127.0.0.1:11435;
}

server {
    listen 8080;

    location / {
        proxy_pass http://ollama_cluster;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # Streaming response support
        proxy_buffering off;
        proxy_cache off;
    }
}

The least_conn strategy means: each new request goes to the instance with the fewest current connections. This way, both GPUs get more balanced load.

Step 3: Client Calls

Clients only need to connect to Nginx’s port:

# Call through Nginx (automatically distributed to an instance)
curl http://localhost:8080/api/generate -d '{
  "model": "llama3",
  "prompt": "Hello"
}'

Or modify the Ollama client’s default address:

export OLLAMA_HOST=http://localhost:8080
ollama run llama3

2.3 Load Balancing Strategy Comparison

Nginx supports several load balancing strategies, each with different use cases:

StrategyPrincipleUse Case
Round Robin (default)Distribute to instances in sequenceSimple scenarios, uniform model sizes
Least Connections (least_conn)Send to currently least busy instanceRecommended for inference services
IP HashSame IP always goes to same instanceScenarios requiring session persistence

Inference services have unpredictable request durations—some return in seconds, others run for minutes. With round robin, one instance might be overwhelmed while another sits idle. least_conn avoids this problem.

If you want more even distribution with automatic failover when an instance crashes, add health checks:

upstream ollama_cluster {
    least_conn;
    server 127.0.0.1:11434 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:11435 max_fails=3 fail_timeout=30s;
}

This way, if an instance fails 3 consecutive times, Nginx temporarily removes it from the cluster, retrying after 30 seconds.

3. VRAM Optimization Strategies: Quantization, Context, and Batching in Practice

When VRAM is insufficient, the parameter adjustment priority is: quantization > context length > batch size > GPU layers.

Why this order? Because quantization has the biggest impact—the same model with Q4 quantization uses 75% less VRAM than FP16. Adjusting GPU layers only moves computation from GPU to CPU, saving VRAM but sacrificing speed.

3.1 Quantization Level Selection

Quantization uses fewer bits to store model parameters. FP16 uses 16 bits per parameter, Q4 uses only 4 bits. Fewer bits means precision loss, but real-world testing shows Q4 quantization has only 2-3% quality loss, which is acceptable for most scenarios.

Quantization level comparison:

QuantizationVRAM Usage (relative to FP16)Quality LossUse Case
Q4_K_M~25%2-3%Recommended: balance of performance and quality
Q5_K_M~33%1-2%Scenarios requiring slightly higher precision
Q8_0~50%0.5%Near-original precision
FP16100%NoneResearch, benchmarking

Real Data Reference: Llama 2 13B model

  • FP16: ~26GB VRAM
  • Q4_K_M: ~8GB VRAM
  • Q8_0: ~13GB VRAM

So 8GB VRAM running 13B Q4 model fits perfectly. But KV cache also needs space, making it prone to overflow during inference.

When choosing quantization level: for daily use, Q4_K_M is sufficient. For tasks requiring high precision like translation or code generation, consider Q5_K_M or Q8_0.

Ollama downloads Q4 quantization by default. To use other quantization versions, add suffix to model name:

# Q4 quantization (default)
ollama pull llama3

# Q8 quantization
ollama pull llama3:8b-q8_0

3.2 Context Length Optimization

KV cache is used during inference to store previous conversation history. Its VRAM usage directly correlates with context length.

Estimation Formula (simplified):

KV Cache VRAM ≈ num_ctx × num_layers × hidden_dim × 2 bytes

Take Llama 2 7B as an example:

  • num_layers = 32
  • hidden_dim = 4096
  • num_ctx = 4096

Calculated KV cache is about 2GB. If you expand ctx to 8192, KV cache becomes 4GB. Double the context, double the KV cache VRAM.

Optimization Strategies:

  1. Short conversation scenarios: Use num_ctx: 2048

    • Saves half the KV cache VRAM
    • Sufficient for daily Q&A and simple tasks
  2. Long document processing: Don’t directly set ctx to 16000 or higher; use chunking strategy

    • Split documents into chunks, process sequentially
    • More stable and controllable than loading everything at once

Set context length in Modelfile:

FROM llama3
PARAMETER num_ctx 2048  # Reduce context length

A common misconception: many think reducing ctx affects output quality. It doesn’t—ctx only affects how much previous conversation the model can “remember.” If your conversation only has a few turns, ctx at 2048 or 4096 makes no difference.

3.3 Batching and Concurrency Optimization

The num_batch parameter controls how many tokens to process at once. Default is 512, meaning Ollama processes 512 tokens’ worth of inference at a time.

What’s the benefit of larger batches? Higher parallel computing efficiency. The tradeoff is higher peak VRAM usage.

When VRAM is tight, reducing batch size alleviates peak pressure:

FROM llama3
PARAMETER num_batch 256  # Reduce from 512 to 256

In practice, reducing batch from 512 to 256 lowers peak VRAM by about 20%. Inference speed drops a bit, but not as dramatically as reducing GPU layers.

Concurrency Issues

Ollama processes requests serially by default—one request completes before the next starts. If you send multiple requests simultaneously, they queue up.

Two solutions to improve concurrency:

  1. Multi-instance deployment: The multi-GPU load balancing solution mentioned earlier, where each instance processes requests independently
  2. Queue system: Add a queue at the application layer (like Redis Queue) to manage request distribution

The second solution is better for scenarios without multiple GPUs. Handle it in application code:

import redis
from queue import Queue

# Use Redis as queue
r = redis.Redis()
r.lpush('ollama_queue', request_data)

# Background worker retrieves and processes requests
request = r.rpop('ollama_queue')
ollama.generate(request)

4. Real-World Scenarios: 3 Case Studies

Enough theory—let’s look at actual problems and solutions.

4.1 Scenario 1: Running 13B Model Stably on 8GB VRAM

Problem

User has RTX 3060 (8GB VRAM), wants to run Llama 2 13B Q4 model. Model itself needs about 8GB, just barely fits. But after a few inferences, OOM errors start appearing—KV cache overflows the VRAM.

Solution

Core approach: reduce KV cache usage + lower peak VRAM.

FROM llama2:13b-q4

PARAMETER num_gpu 30      # 13B model has 40 layers, only put 30 on GPU
PARAMETER low_vram true   # Enable low VRAM mode, KV cache goes to CPU memory
PARAMETER num_ctx 2048    # Halve context length, halve KV cache
PARAMETER num_batch 256   # Reduce batch size, lower peak

Combined, these parameters keep VRAM usage stable around 6GB, leaving 2GB headroom for fluctuations.

Results

6GB
Stable VRAM usage
来源: Reduced from 8GB to 6GB, no more OOM
  • VRAM usage: from ~8GB down to ~6GB (stable operation)
  • Inference speed: ~8 tokens/s (slower than full GPU, but much faster than CPU)
  • Stability: no more OOM crashes

The tradeoff is slower inference speed. Because 10 layers must be computed by CPU, each GPU-CPU transfer incurs data overhead. But at least it works without crashing unexpectedly.

4.2 Scenario 2: Dual GPU Load Balancing to Increase Throughput

Problem

User has two RTX 3090s (24GB VRAM each), deployed Ollama as an external API service. Problem is single instance only processes requests serially, poor concurrency, requests queue up during peak hours.

Checking nvidia-smi, the two cards have vastly different utilization—one consistently 70%+, the other only 20% or so.

Solution

Multi-instance + Nginx load balancing, detailed in chapter 2. Here’s the complete startup script:

#!/bin/bash
# start_ollama_cluster.sh

# Instance 1 - GPU 0
CUDA_VISIBLE_DEVICES=0 \
OLLAMA_HOST=127.0.0.1:11434 \
OLLAMA_MODELS=/home/user/.ollama \
nohup ollama serve > ollama1.log 2>&1 &

# Instance 2 - GPU 1
CUDA_VISIBLE_DEVICES=1 \
OLLAMA_HOST=127.0.0.1:11435 \
OLLAMA_MODELS=/home/user/.ollama \
nohup ollama serve > ollama2.log 2>&1 &

# Preload models to both instances
sleep 5
curl http://127.0.0.1:11434/api/pull -d '{"name": "llama3"}'
curl http://127.0.0.1:11435/api/pull -d '{"name": "llama3"}'

echo "Ollama cluster started on ports 11434 and 11435"

Nginx configuration uses least_conn strategy to ensure even request distribution.

Results

80%
Throughput increase
来源: Dual-instance parallel vs single-instance serial
  • Overall throughput: ~80% increase (from single-instance serial to dual-instance parallel)
  • Single GPU utilization: from 40% average → 80% average (both cards working)
  • Response latency: ~50% reduction during peak hours (no more queuing)

Real data: single instance processing 100 requests takes about 10 minutes, dual-instance load balancing takes just over 5 minutes.

4.3 Scenario 3: Automating Dynamic VRAM Allocation

Problem

User has multiple models of different sizes, needs to manually adjust GPU layer configuration when switching. Sometimes forgets to change, crashes. Can this be automated?

Solution

Write a script to automatically choose appropriate Modelfile configuration based on current VRAM.

#!/bin/bash
# auto_offload.sh - Automatic GPU offloading configuration

# Get current GPU free VRAM (in MB)
GPU_MEM_FREE=$(nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits | head -1)

# Model size reference (MB)
declare -A MODEL_SIZES
MODEL_SIZES["llama3:8b-q4"]=5000
MODEL_SIZES["llama3:70b-q4"]=40000
MODEL_SIZES["mistral:7b-q4"]=4500

MODEL_NAME=$1

if [ -z "$MODEL_NAME" ]; then
    echo "Usage: $0 <model_name>"
    exit 1
fi

MODEL_SIZE=${MODEL_SIZES[$MODEL_NAME]}

if [ -z "$MODEL_SIZE" ]; then
    echo "Unknown model size for $MODEL_NAME"
    exit 1
fi

# Determine if VRAM is sufficient for full GPU
if [ $GPU_MEM_FREE -gt $MODEL_SIZE ]; then
    # Full GPU offloading
    echo "Using full GPU offloading (enough memory)"
    cat > /tmp/modelfile_temp <<EOF
FROM $MODEL_NAME
PARAMETER num_gpu -1  # -1 means full GPU
PARAMETER low_vram false
EOF
else
    # Partial GPU, calculate appropriate layer ratio
    OFFLOAD_RATIO=$((GPU_MEM_FREE * 100 / MODEL_SIZE))
    echo "Using partial GPU offloading ($OFFLOAD_RATIO%)"
    cat > /tmp/modelfile_temp <<EOF
FROM $MODEL_NAME
PARAMETER num_gpu $OFFLOAD_RATIO
PARAMETER low_vram true
PARAMETER num_ctx 2048
EOF
fi

# Create model
ollama create "${MODEL_NAME}-auto" -f /tmp/modelfile_temp
echo "Created ${MODEL_NAME}-auto with auto config"

Usage:

# Run script to automatically create model with appropriate config
./auto_offload.sh llama3:70b-q4

Results

  • Automatically adapts to VRAM changes
  • Reduces manual configuration errors
  • No need to change parameters when switching models

This script can be extended: add monitoring to automatically switch to low VRAM mode when memory runs low, or use scheduled tasks to preload models during off-hours.

A quick reference table to help you find the right configuration for your hardware:

VRAM SizeRecommended ModelQuantizationGPU LayersOther Parameters
6GB7B modelQ4Partial (~50%)low_vram=true, ctx=2048
8GB7B modelQ4Full GPUctx=2048 (safe)
8GB13B modelQ4Partial (~75%)low_vram=true, ctx=2048, batch=256
12GB13B modelQ4Full GPUctx=4096 usable
16GB13B modelQ8 or Q5Full GPUctx=4096
16GB70B modelQ4Partial (~50%)low_vram=true
24GB70B modelQ4Full GPUctx=4096 usable
48GB (dual)70B modelQ4Full GPUMulti-instance load balancing

Note: These are conservative estimates. You also need to consider KV cache and system reserved space. If your scenario involves long conversations (large context), be more conservative.

5.2 VRAM Monitoring Tools

nvidia-smi Real-time Monitoring

Simplest approach:

# Refresh every second
nvidia-smi -l 1

# View only VRAM usage
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1

Output shows VRAM usage per card. Watch it during inference to see how VRAM grows.

Ollama Verbose Logging

ollama run llama3 --verbose

Output displays detailed information during model loading, including:

  • GPU offloading layer count
  • Model memory usage
  • Whether mmap is enabled
  • KV cache allocation

Seeing GPU offloading: 40/40 layers tells you the model is fully on GPU.

Monitoring Script Example

For long-term VRAM usage monitoring, write a script to log data:

#!/bin/bash
# monitor_gpu.sh

LOG_FILE="gpu_memory.log"

while true; do
    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    GPU_MEM=$(nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader)
    echo "$TIMESTAMP $GPU_MEM" >> $LOG_FILE
    sleep 5
done

Run it in background, check historical data anytime.

5.3 Common Issue Troubleshooting

Issue 1: OOM During Inference

Troubleshooting steps:

  1. First check nvidia-smi to confirm VRAM is indeed insufficient
  2. Check current configuration:
    • Is quantization Q4? (If not, change to Q4)
    • Is context length too large? (Change to 2048)
    • Is batch size too large? (Change to 256)
    • Is GPU layer count full GPU? (Reduce a few layers)
  3. If all above are adjusted and still not working, enable low_vram=true

Adjustment priority: quantization > ctx > batch > GPU layers > low_vram

Issue 2: Slow Inference Speed

First confirm if GPU offloading layer count is insufficient:

ollama run your_model --verbose | grep "GPU offloading"

If you see GPU offloading: 20/40 layers, that means half the layers are computed on CPU, slow speed is normal.

Solution: reduce quantization level (Q4 → Q8) or get a GPU with more VRAM. If neither is possible, accept the speed.

Issue 3: VRAM Fluctuations, Instability

VRAM fluctuations mainly come from KV cache. Longer conversations mean larger KV cache.

Solution: limit context length, or control conversation history length at application layer (like keeping only the last 10 turns).

Issue 4: Multi-GPU Configured But Still Only Using One Card

Check if Nginx configuration is effective:

curl http://localhost:8080/api/tags

If you only see one model’s response, requests are indeed being distributed.

If the two cards have very different utilization, possible causes:

  • least_conn strategy not configured
  • One instance has problems (check logs)
  • Model only loaded on one instance

Summary

After all this discussion, the core points are:

  1. When VRAM is tight, prioritize quantization: Q4 saves 75% VRAM compared to FP16 with minimal quality loss
  2. Watch KV cache usage: Context length directly affects KV cache; long conversations mean more VRAM pressure
  3. Use load balancing for multi-GPU: Single-instance multi-GPU mode is limited; multi-instance + Nginx is the real solution
  4. Understand llama.cpp internals: GPU offloading isn’t magic; it’s layered computation with data transfer overhead

Here are some ready-to-use configurations:

Stable 8GB VRAM Configuration:

PARAMETER num_gpu 30
PARAMETER low_vram true
PARAMETER num_ctx 2048
PARAMETER num_batch 256

Dual GPU Load Balancing Startup:

CUDA_VISIBLE_DEVICES=0 OLLAMA_HOST=127.0.0.1:11434 ollama serve &
CUDA_VISIBLE_DEVICES=1 OLLAMA_HOST=127.0.0.1:11435 ollama serve &

Finally, if this article helped, check out other articles in the series. Part 6 covers quantization and batching basics; this article is the deep dive into GPU aspects. Part 8 will cover multi-model parallel deployment, applying multi-GPU configuration to more complex scenarios.

For questions, search Ollama GitHub Discussions—many practical issues are discussed in the community. Or leave a comment, and I’ll respond when I see it.


FAQ

Can Ollama split one model across multiple GPUs for parallel computation?
No. Ollama doesn't support model parallelism (Tensor Parallelism). Each model instance can only bind to one GPU. To leverage multiple GPUs, run multiple instances + Nginx load balancing.
Why does OOM occur after model loads successfully, but only after a few inferences?
Because of KV cache. Model loading only calculates VRAM used by the model itself, but KV cache grows with conversation length during inference. Suggestions:

• Reduce context length (num_ctx)
• Enable low_vram mode
• Shorten conversation history
Which parameter should I adjust first when VRAM is insufficient?
Priority: quantization > context length > batch size > GPU layers > low_vram. Quantization has the biggest impact—Q4 saves 75% VRAM with only 2-3% quality loss.
Does num_gpu parameter mean how many GPUs I have?
No. num_gpu refers to how many model layers to compute on GPU. For a 32-layer model, num_gpu=32 means full GPU; num_gpu=20 means 20 layers on GPU, 12 layers on CPU.
What strategy should I use for multi-GPU load balancing?
Recommend least_conn (least connections priority). Inference request durations are unpredictable; round robin might cause one instance to be overwhelmed while another sits idle. least_conn ensures requests go to the currently least busy instance.
What size model can 8GB VRAM run?
Conservative configuration:

• 7B Q4: Full GPU, ctx=2048
• 13B Q4: Partial GPU (~75%), requires low_vram + ctx=2048 + batch=256
• Larger models need more VRAM or CPU offloading

15 min read · Published on: Apr 11, 2026 · Modified on: Apr 11, 2026

Comments

Sign in with GitHub to leave a comment

Related Posts