Switch Language
Toggle Theme

Ollama Production Monitoring: Logging Configuration and Prometheus Alerting in Practice

3:17 AM. My phone vibrates once on the nightstand, then a second time, a third. I groggily swipe the screen, and the red Slack alert burns my eyes: Ollama API timeout - service unavailable.

My first thought: this is bad.

We had just launched a Llama 3.1-based customer service system two weeks earlier. The user base was small, maybe a few hundred calls per day. When we deployed it, I’ll admit I was a bit nervous—we only had basic logging configured, no monitoring or alerting whatsoever. The plan was “let’s get it running first.” The result? When I was woken up at 3 AM, I had no idea what was wrong. GPU memory full? Process crashed? Network issue? I was completely in the dark.

That incident took until 6 AM to resolve. In the post-mortem, I found that many teams are in the same boat.

70%
AI projects fail to reach production
来源: Hyperion Consulting 2026 Report

Lack of monitoring is one of the main culprits.

This article is my attempt to help you avoid the pitfalls I encountered. I’ll share a complete solution—from logging configuration to Prometheus + Grafana monitoring to AlertManager setup—with configuration files you can copy and use directly. Following this guide, you can set up a production-grade monitoring system in about 30 minutes. Honestly, if I’d had this setup back then, I could have slept at least three more hours that night.

Core Challenges of Production Monitoring

Ollama is different from typical web services. It’s a “resource hog.” Each loaded model consumes 4 to 16 GB of memory alone (data from Markaicode’s benchmarks). And cold starts—loading models from disk to memory—take 10 to 30 seconds. This means if your service crashes and restarts, users have to wait half a minute before getting a response.

The pitfalls I’ve encountered include:

Memory leaks and GPU exhaustion. After running for extended periods, Ollama sometimes “forgets” to release GPU memory. I’ve seen a 24GB VRAM machine that, after two days of running, had only 2GB available—all new requests were rejected. The problem was, I had no idea what was happening until users started complaining.

Request queue buildup. Inference is inherently slow; a single request can take 5-20 seconds. If dozens of requests arrive simultaneously, the queue grows longer and longer until timeouts occur. But how do you know if the queue is backing up? You can only guess.

Model loading latency. When switching between multiple models, loading time is a black box. Users don’t know why responses are slow, and neither do you.

So the monitoring objectives are clear: service availability (is the process still running?), performance metrics (how fast are responses?), resource utilization (how much GPU memory is left?), and error rate (how many requests failed?). Once these four dimensions are covered, you can have peace of mind.

For monitoring solution selection, I’ve tried several combinations. For small teams, Prometheus + Grafana is sufficient; if you need to track LLM Prompts and responses, Langfuse is excellent; for enterprise environments, consider SigNoz, which is based on OpenTelemetry and unifies logs, metrics, and traces. I’ll focus on the Prometheus solution since it’s the most universal foundation.

Logging Configuration and systemd Service Optimization

Getting Ollama running is easy, but keeping it stable requires getting logging right first. I learned this the hard way—when something went wrong and I went to check the logs, I found nothing was recorded, or the log files had ballooned to dozens of GB and filled the disk.

systemd Service Configuration

If you installed Ollama using the official script, it already created a systemd service for you. But the default configuration is basic. For production environments, you need to add a few things:

# /etc/systemd/system/ollama.service

[Unit]
Description=Ollama Service
After=network.target

[Service]
Type=simple
User=ollama
Group=ollama

# Working directory
WorkingDirectory=/usr/share/ollama

# Environment variables
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_DEBUG=1"
Environment="OLLAMA_LOG_FORMAT=json"

# Resource limits (adjust based on your hardware)
LimitNOFILE=65535
LimitNPROC=4096
MemoryMax=32G

# Auto-restart strategy
Restart=always
RestartSec=10

# Startup command
ExecStart=/usr/local/bin/ollama serve

# Standard output and error output
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Let me share some key lessons learned:

Restart=always and RestartSec=10: Automatically restart the process after abnormal exit. The 10-second wait gives the system some breathing room. I once encountered repeated crashes due to memory exhaustion—without this interval, it would have restarted frantically and flooded the logs.

MemoryMax=32G: Limit the maximum memory Ollama can use. This is critical if your machine runs other services. I once didn’t set a limit, and Ollama consumed all 64GB of memory—I couldn’t even SSH in.

OLLAMA_DEBUG=1 and OLLAMA_LOG_FORMAT=json: I recommend enabling debug mode in production—it’s invaluable when troubleshooting issues. JSON format makes it easier to parse logs with tools later.

After modifying the configuration, don’t forget to reload:

sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo systemctl enable ollama  # Enable on boot

Docker Deployment Logging Configuration

If running with Docker, log management is even easier to mess up. Docker writes logs to /var/lib/docker/containers/ by default, and without limits, they grow indefinitely.

My docker-compose configuration looks like this:

# docker-compose.yml
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: always
    ports:
      - "11434:11434"
    volumes:
      - ./ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - OLLAMA_DEBUG=1
    deploy:
      resources:
        limits:
          memory: 32G
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "5"

max-size: "100m" means each log file is capped at 100MB, and max-file: "5" keeps 5 files maximum. That’s at most 500MB of logs—enough for troubleshooting without filling the disk.

Log Level Reference

Ollama supports these environment variables:

VariableDescriptionProduction Recommendation
OLLAMA_DEBUGSet to 1 to enable detailed loggingRecommended enabled
OLLAMA_LOG_LEVELLog level (INFO/DEBUG/WARN)INFO or DEBUG
OLLAMA_LOG_FORMATLog format (text/json)JSON

I generally keep DEBUG enabled—disk space isn’t an issue, and it saves a lot of time when troubleshooting.

Practical journalctl Logging

Once configured, use journalctl to view logs:

# View logs in real-time
sudo journalctl -u ollama -f

# View last 100 lines
sudo journalctl -u ollama -n 100

# View today's logs
sudo journalctl -u ollama --since today

# Search for specific keywords
sudo journalctl -u ollama | grep -i "error"

# Export logs to file
sudo journalctl -u ollama --since "2026-04-12 00:00:00" > ollama-debug.log

Here’s a tip: if you enabled JSON format logging, you can use jq to parse:

sudo journalctl -u ollama -o json | jq 'select(.level=="error")'

This filters to error-level logs only—no more digging through piles of INFO entries.

Prometheus + Grafana Monitoring Solution

Logging is for post-mortem analysis; monitoring is the early warning system. I’ve been using Prometheus + Grafana for over two years. The setup can be tedious, but it’s stable, reliable, and has abundant community resources.

ollama-exporter Deployment

Ollama doesn’t expose Prometheus metrics directly—you need an exporter to collect them. I use frcooper/ollama-exporter. While it has only 36 stars, it does the job.

There are two deployment options: run the binary directly, or use Docker. I recommend Docker:

# Add exporter service to docker-compose.yml
services:
  ollama-exporter:
    image: frazco/ollama-exporter:latest
    container_name: ollama-exporter
    restart: always
    ports:
      - "9101:9101"
    environment:
      - OLLAMA_HOST=ollama:11434  # Point to ollama container
    depends_on:
      - ollama

Then the Prometheus configuration:

# prometheus.yml
global:
  scrape_interval: 30s  # Scrape interval, Markaicode recommends 30 seconds
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'ollama-exporter'
    static_configs:
      - targets: ['ollama-exporter:9101']
        labels:
          instance: 'ollama-prod'

  # GPU monitoring (if using NVIDIA)
  - job_name: 'nvidia-gpu'
    static_configs:
      - targets: ['localhost:9835']

Add Prometheus to docker-compose as well:

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: always
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

volumes:
  prometheus_data:

Key Monitoring Metrics

The ollama-exporter collects these metrics—here are the important ones:

Metric NameDescriptionWatch For
ollama_requests_totalTotal requestsError rate calculation
ollama_requests_failedFailed requestsDirect monitoring
ollama_model_load_duration_secondsModel load timeCold start performance
ollama_request_duration_secondsRequest response timeP95/P99 latency
ollama_tokens_per_secondInference speedThroughput

There are also system-level metrics (requiring node-exporter):

  • CPU utilization: node_cpu_seconds_total
  • Memory utilization: node_memory_MemAvailable_bytes
  • Network traffic: node_network_receive_bytes_total

GPU Monitoring Configuration

The GPU is the heart of an LLM service—monitoring must be thorough. I use nvidia_gpu_prometheus_exporter:

# Install NVIDIA GPU exporter
docker run -d \
  --name nvidia-exporter \
  --restart always \
  -p 9835:9835 \
  --gpus all \
  nvidia/gpu-prometheus-exporter:latest

It outputs these key metrics:

  • nvidia_gpu_utilization: GPU utilization
  • nvidia_gpu_memory_used_bytes: Memory usage
  • nvidia_gpu_memory_free_bytes: Available memory
  • nvidia_gpu_temperature: GPU temperature

In multi-GPU environments, metrics include a gpu_id label, allowing you to display each card separately in Grafana.

Grafana Dashboard Configuration

I’ll give you a ready-to-import Grafana Dashboard JSON. Save this as a file, then in Grafana click Import Dashboard:

{
  "dashboard": {
    "title": "Ollama Production Monitor",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(ollama_requests_total[5m])",
            "legendFormat": "Requests/sec"
          }
        ],
        "gridPos": {"x": 0, "y": 0, "w": 12, "h": 6}
      },
      {
        "title": "Error Rate",
        "type": "gauge",
        "targets": [
          {
            "expr": "rate(ollama_requests_failed[5m]) / rate(ollama_requests_total[5m]) * 100",
            "legendFormat": "Error %"
          }
        ],
        "gridPos": {"x": 12, "y": 0, "w": 6, "h": 6}
      },
      {
        "title": "GPU Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100",
            "legendFormat": "GPU {{gpu_id}}"
          }
        ],
        "gridPos": {"x": 0, "y": 6, "w": 12, "h": 6}
      },
      {
        "title": "Response Latency P95",
        "type": "stat",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P95 Latency"
          }
        ],
        "gridPos": {"x": 12, "y": 6, "w": 6, "h": 6}
      }
    ]
  },
  "overwrite": true
}

The actual effect looks something like this:

  • Top left: Request rate curve, shows peak periods
  • Top right: Error rate gauge, turns red above 5%
  • Bottom left: Multi-GPU memory usage curves
  • Bottom right: P95 latency value

I also add a Tokens/s panel to compare inference speeds across different models horizontally.

Grafana Data Source Configuration

After the Grafana container starts, you need to manually configure the Prometheus data source:

  1. Login to Grafana (default admin/admin)
  2. Configuration -> Data Sources -> Add data source
  3. Select Prometheus, URL set to http://prometheus:9090
  4. Save & Test

If using docker-compose for deployment, the containers can communicate directly using container names.

Alert Rules and AlertManager Configuration

Monitoring shows you problems, but alerting tells you “handle this now.” I once made a mistake: I set all alerts to critical, my phone buzzed dozens of times a day, and eventually I became numb—when a real issue came up, I didn’t react properly.

Alert Tiering Strategy

I divide alerts into three tiers. This logic came from iterating through several incidents:

LevelTrigger ConditionResponse Requirement
CriticalService down, GPU memory >95%, error rate >20%Immediate action (Slack + phone push)
WarningResponse time >60s, GPU memory >80%, error rate >5%Review within 1 hour (Slack only)
InfoModel switch, new version deploymentLog only (email digest)

Key principle: Critical alerts must be rare—when you see one, it should make you nervous.

Prometheus Alert Rules

Add alert rules to prometheus.yml:

rule_files:
  - 'ollama_alerts.yml'

Then create a separate ollama_alerts.yml:

# ollama_alerts.yml
groups:
  - name: ollama_critical
    rules:
      # Service down alert
      - alert: OllamaServiceDown
        expr: up{job="ollama-exporter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Ollama Service Down"
          description: "Ollama exporter unreachable, service may have stopped"

      # GPU memory alert (>95%)
      - alert: GPUMemoryCritical
        expr: nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes > 0.95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "GPU Memory Nearly Exhausted"
          description: "GPU {{ gpu_id }} memory usage exceeds 95%, currently {{ $value | humanizePercentage }}"

      # High error rate alert
      - alert: HighErrorRate
        expr: rate(ollama_requests_failed[5m]) / rate(ollama_requests_total[5m]) > 0.20
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Request Error Rate Too High"
          description: "Error rate exceeded 20% in the last 5 minutes, check logs"

  - name: ollama_warning
    rules:
      # Response time alert
      - alert: SlowResponseTime
        expr: histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m])) > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 Response Time Too Slow"
          description: "95% of requests have response time exceeding 60 seconds"

      # GPU memory warning
      - alert: GPUMemoryWarning
        expr: nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes > 0.80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU Memory Usage High"
          description: "GPU {{ gpu_id }} memory usage exceeds 80%"

      # Error rate warning
      - alert: ErrorRateWarning
        expr: rate(ollama_requests_failed[5m]) / rate(ollama_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Request Error Rate Rising"
          description: "Error rate exceeded 5% in the last 5 minutes"

A few notes:

  • for: Xm: Trigger only after X minutes to avoid false positives from momentary spikes
  • GPU alert threshold at 95%: In practice, once you exceed 95%, things go wrong almost immediately
  • Error rate alerts use rate(): Absolute numbers are meaningless; you need to look at trends

AlertManager Configuration

AlertManager handles sending alerts. Configuration file alertmanager.yml:

global:
  resolve_timeout: 5m

# Routing configuration
route:
  group_by: ['severity', 'alertname']
  group_wait: 30s      # Wait 30 seconds to collect alerts in same group
  group_interval: 5m   # Interval between same group alerts
  repeat_interval: 3h  # Repeat interval for unresolved alerts
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
      continue: false
    - match:
        severity: warning
      receiver: 'warning-alerts'
      continue: false
    - match:
        severity: info
      receiver: 'info-alerts'

# Receiver configuration
receivers:
  - name: 'critical-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#ollama-critical'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'warning-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#ollama-monitor'
        send_resolved: true

  - name: 'info-alerts'
    email_configs:
      - to: 'your-email@example.com'
        send_resolved: true

Slack Webhook Configuration Steps

  1. Create an App in Slack (or use Incoming Webhooks)
  2. Add the Webhook URL to the api_url field
  3. Recommended to use separate channels: dedicated channel for critical, regular channel for warnings

I also add mobile push notifications. If you use PagerDuty or OpsGenie, AlertManager has built-in integrations. For a free option, Telegram Bot works well and isn’t complicated to configure.

Silences and Inhibition Rules

Sometimes you need to temporarily silence alerts, like during maintenance. You can do this directly in the AlertManager UI:

# Access AlertManager UI
http://your-server:9093

# Click Silences -> New Silence
# Set duration, match labels

You can also use the API:

curl -X POST http://localhost:9093/api/v1/silences \
  -d '{
    "matchers": [{"name": "alertname", "value": "OllamaServiceDown", "isRegex": false}],
    "startsAt": "2026-04-12T10:00:00Z",
    "endsAt": "2026-04-12T12:00:00Z",
    "createdBy": "admin",
    "comment": "Scheduled maintenance"
  }'

Advanced LLM-Specific Monitoring Tools

Prometheus + Grafana is a general-purpose solution, but LLMs have special monitoring needs: Prompt tracing, Token costs, response quality evaluation. These metrics are hard to track with traditional monitoring tools.

Langfuse: LLM Tracing and Prompt Management

Langfuse is a monitoring platform designed specifically for LLM applications. It’s MIT-licensed open source and supports self-hosting. What it can do:

  • Trace every conversation: Record input Prompt, output content, Token count, duration
  • Prompt version management: Compare effects after Prompt changes
  • Quality evaluation: Record user feedback, manual annotations, track model output quality

Integration is straightforward—Langfuse has official Ollama support:

# Python integration example
from langfuse import Langfuse
import requests

langfuse = Langfuse(
    public_key="pk-xxx",
    secret_key="sk-xxx",
    host="https://cloud.langfuse.com"  # Or self-hosted address
)

# Record each call
trace = langfuse.trace(
    name="ollama-chat",
    input={"prompt": user_prompt},
    metadata={"model": "llama3.1"}
)

response = requests.post(
    "http://localhost:11434/api/generate",
    json={"model": "llama3.1", "prompt": user_prompt}
)

trace.update(
    output=response.json()["response"],
    metadata={"tokens": response.json().get("eval_count", 0)}
)

Deploy the self-hosted version with Docker:

services:
  langfuse-server:
    image: langfuse/langfuse:latest
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgres://user:pass@db:5432/langfuse
      - NEXTAUTH_SECRET=your-secret

If you use LangChain, integration is even simpler—Langfuse has an official callback handler.

SigNoz: OpenTelemetry Unified Monitoring

SigNoz is an OpenTelemetry-based observability platform that unifies logs, metrics, and traces. The benefit is you don’t need to maintain Prometheus, Jaeger, and ELK separately.

For LLM applications, SigNoz’s tracing is practical: you can see the complete chain from API entry to model inference to database queries for a single request.

Deploying SigNoz requires more resources—at least 4GB of RAM recommended. Official Docker Compose one-click deployment:

git clone https://github.com/SigNoz/signoz.git
cd signoz/deploy/docker
docker compose up -d

Tool Selection Recommendations

Here’s my recommendation for different scenarios:

ScenarioRecommended SolutionReason
Small team (under 5 people)Prometheus + GrafanaSimple and sufficient, rich community resources
Need Prompt tracingPrometheus + LangfuseLangfuse focuses on LLM, complementary
Enterprise multi-serviceSigNoz + OpenTelemetryUnified platform, lower ops cost
Pure cloud-nativeUse managed servicesSave ops effort

I currently use the Prometheus + Grafana + Langfuse combination. Prometheus handles infrastructure metrics, Langfuse handles the LLM application layer—separate responsibilities, clear picture.

Final Thoughts

After all this, it comes down to one thing: Don’t wait for problems to think about monitoring.

That 3 AM lesson cost me a complete monitoring solution. Now my Ollama service has been running for over a year. I’ve encountered GPU memory alerts a few times, but they were all handled at the Warning level—never woken up in the middle of the night again.

The setup cost for this solution is actually low. I’ve organized all the configuration files—you can download and use them directly:

  • systemd service configuration
  • Docker Compose complete deployment (Ollama + Exporter + Prometheus + Grafana)
  • Prometheus alert rules
  • AlertManager configuration template
  • Grafana Dashboard JSON

The supporting GitHub repository is linked at the end of the article. Following this configuration, experienced users can get it running in 20 minutes, beginners in about 30.

Next steps I recommend:

  1. Start with basic Prometheus + Grafana to get metrics flowing
  2. Observe for 3-5 days to understand normal data ranges
  3. Adjust alert thresholds based on actual conditions
  4. Add Langfuse if you need Prompt tracing

Monitoring is an investment you make once with continuous returns. I hope you don’t have to learn this lesson the hard way at 3 AM like I did.


Configuration Repository: github.com/yourname/ollama-monitoring-config (example link, replace with actual deployment)

Series Articles:

FAQ

What are the core metrics needed for Ollama production monitoring?
Four dimensions: service availability (process alive status), performance metrics (P95/P99 response latency), GPU memory utilization (prevent exhaustion), request error rate (track anomaly trends).
What is the difference between Prometheus + Grafana and Langfuse?
Prometheus + Grafana monitors infrastructure metrics (CPU, GPU, memory, request volume), while Langfuse focuses on the LLM application layer (Prompt tracing, Token costs, response quality evaluation). They are complementary—recommended to use together.
How should I set reasonable alert thresholds?
Recommend three tiers: Critical (GPU memory >95%, error rate >20%, service down) for immediate action; Warning (GPU memory >80%, error rate >5%) for review within 1 hour. The key is Critical alerts should be rare—when you see one, it should make you nervous.
What should I do if log files grow indefinitely in Docker deployment?
Add max-size: "100m" and max-file: "5" in the logging configuration of docker-compose.yml to limit each log file to 100MB and keep a maximum of 5 files, totaling no more than 500MB.
How do I monitor each GPU card separately in a multi-GPU environment?
The nvidia_gpu_prometheus_exporter metrics include a gpu_id label. In Grafana PromQL, use {{gpu_id}} as legendFormat to display each card separately.
How do I quickly diagnose issues when I receive an alert at 3 AM?
First check GPU memory curves (is it being consumed?), then check error rate trends (sudden spike or gradual degradation?), finally search journalctl logs for specific error messages. This sequence helps you identify the root cause within 10 minutes.

12 min read · Published on: Apr 12, 2026 · Modified on: Apr 12, 2026

Comments

Sign in with GitHub to leave a comment

Related Posts