Switch Language
Toggle Theme

Docker Compose Production Deployment: Health Checks, Restart Policies, and Log Management

3 AM, and server alerts are blowing up your phone. You SSH in to find disk space at 99%—container logs are eating up 50GB.

That’s not even the worst part. Last year, a project’s API container showed “running” status, but the database connection had been dead for hours. Every request returned a 500 error. It took three hours to track down the issue. According to Last9’s research, container “zombie” states waste an average of 3.2 hours of troubleshooting time per incident.

Let’s be honest—many teams deploying Docker Compose to production just configure port mapping and volume mounts, then throw containers into the wild. No health checks, no log rotation, restart policy is just a lazy restart: always. The result: containers look like they’re running but are actually dead; log files grow uncontrollably until they fill the disk; crashing services restart infinitely, consuming all CPU and memory.

In this article, I’ll break down three core production configurations—health checks, restart policies, and log management—into clear, actionable steps. Not just configuration examples, but health check commands for common services, troubleshooting steps, and a complete docker-compose.yml template you can copy and use.

Health Checks — Make Containers Truly Alive

A container showing running status doesn’t mean the application is actually working. Database connection failures, ports not listening, frozen processes—Docker has no idea about these. Health checks act as a “heartbeat monitor” for containers, regularly checking whether the application can still respond normally.

Configuration Syntax

In docker-compose.yml, the healthcheck configuration looks like this:

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U postgres"]
  interval: 10s      # Check every 10 seconds
  timeout: 5s        # Wait up to 5 seconds per check
  retries: 5         # Mark as unhealthy after 5 consecutive failures
  start_period: 30s  # Give container 30 seconds warm-up time after startup

These parameters need to work together. timeout can’t be larger than interval, otherwise the next check starts before the previous one finishes. start_period shouldn’t be skipped—services like databases start slowly, and if the warm-up time is too short, health checks will falsely mark the container as dead.

Health Check Commands for Common Services

Different services require different check methods. Here are some common ones:

PostgreSQL

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U postgres -d mydb"]
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 30s

pg_isready is PostgreSQL’s built-in check tool, specifically designed to determine if the database is ready to accept connections.

MySQL / MariaDB

healthcheck:
  test: ["CMD-SHELL", "mysqladmin ping -h localhost -u root -p$$MYSQL_ROOT_PASSWORD"]
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 30s

Note the password uses $$ for escaping, otherwise YAML treats $ as a variable reference.

Redis

healthcheck:
  test: ["CMD-SHELL", "redis-cli ping | grep PONG"]
  interval: 10s
  timeout: 3s
  retries: 3

Redis’s ping command returns PONG, filtered with grep to ensure the result is correct.

Web Server (HTTP Check)

healthcheck:
  test: ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 10s

The -f flag makes curl return a non-zero exit code when HTTP status code is not 2xx, triggering a health check failure.

Common Pitfall: Minimal images like Alpine may not have curl installed. Either install it (apk add curl) or use wget instead:

test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1"]

Startup Order Control

The database isn’t ready yet, but the API container starts up, resulting in connection failures, errors, and crashes—I’ve seen this too many times. Using depends_on with condition: service_healthy solves this:

services:
  postgres:
    image: postgres:16
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

  api:
    build: ./api
    depends_on:
      postgres:
        condition: service_healthy  # Wait for postgres health check to pass before starting

This way, Docker Compose waits for postgres’s health check to return healthy before starting the api container. No more awkward “database not ready, API tries to connect” situations.

Restart Policies — Graceful Recovery After Failure

Container crashed—what to do? Automatic restart seems like a good idea. But here’s the problem: if the root cause isn’t resolved, restarting just creates an infinite loop, wasting CPU and memory, and masking the real failure.

Configuration Syntax

Restart policy is configured in the deploy block:

deploy:
  restart_policy:
    condition: on-failure   # Only restart on failure
    delay: 5s               # Wait 5 seconds before restart
    max_attempts: 3         # Maximum 3 restart attempts
    window: 120s            # 120 seconds without failure after restart counts as recovery

condition has three options:

  • none: Don’t restart, leave the container dead
  • on-failure: Only restart when container exits abnormally (non-zero exit code)
  • any: Restart regardless of the situation

Production Recommendation

For production environments, use on-failure instead of always.

Why? restart: always makes containers restart no matter what. Application code has a bug causing crash? Restart. Database connection fails causing process exit? Restart. Configuration file error prevents startup? Still restart. The result is a crash loop, logs filling up, CPU being consumed repeatedly.

on-failure with max_attempts is different—restart at most 3 times, then stop if still failing. Operations can see the container ultimately died and investigate the real problem.

Parameter Tuning

delay is the restart interval. Too short, and the container might not be fully cleaned up before restarting; too long extends recovery time. Generally 5-10 seconds works well.

window is an easily overlooked parameter. It defines: how long after restart without another failure counts as successful restart. For example, setting window: 120s means if the container crashes again within 120 seconds after restart, the max_attempts counter doesn’t reset. This avoids false positives from “restarts successfully for one second then crashes again.”

Health Check and Restart Policy Coordination

Health checks and restart policies don’t work independently—they coordinate:

  1. Health check fails retries times consecutively → Container marked as unhealthy
  2. If restart_policy is configured, Docker attempts to restart the container
  3. After restart, health check counter resets
  4. If health check passes after restart, container returns to normal; if still fails, continue restart attempts until max_attempts is exhausted

This chain gives failures “auto-recovery” capability while limiting the risk of infinite restarts.

Log Management — Prevent Disk Space Exhaustion

The 3 AM alert I mentioned at the beginning—disk at 99%, logs consuming 50GB—I’ve experienced this more than once. Docker’s default json-file log driver doesn’t auto-cleanup old logs, files grow indefinitely. Without log rotation configuration, sooner or later the disk fills up.

Log Rotation Configuration

Add logging configuration in docker-compose.yml:

logging:
  driver: "json-file"
  options:
    max-size: "10m"      # Single log file max 10MB
    max-file: "3"        # Keep maximum 3 log files
    compress: "true"     # Compress old logs to save space

With this configuration, container logs occupy at most 30MB (10MB × 3). When exceeding 10MB, Docker creates a new file; when exceeding 3 files, the oldest gets deleted or compressed.

Log files are stored at /var/lib/docker/containers/<container-id>/<container-id>-json.log. Use the du command to check actual usage:

du -sh /var/lib/docker/containers/*/*-json.log

Driver Selection

Docker supports multiple log drivers: json-file, syslog, fluentd, journald, local, etc. For most scenarios, json-file or local are sufficient.

Docker’s official documentation mentions that the local driver is more efficient than json-file, with built-in log rotation, no need to manually configure max-size/max-file. If you have large log volumes (like tens of GB per day), consider using local:

logging:
  driver: "local"

However, local driver has a drawback: you can’t directly view log content with docker logs. You need to add mode: "non-blocking" in configuration for compatibility.

Centralized Log Collection (Optional)

Single-machine deployment works fine with json-file or local. But if you have dozens of servers and hundreds of containers, logs scattered everywhere become hard to manage. Consider centralized logging solutions:

  • Fluentd: Lightweight log collection, suitable for small clusters
  • ELK Stack (Elasticsearch + Logstash + Kibana): Powerful but high deployment cost
  • Loki + Grafana: Cloud-native solution, integrates well with Prometheus ecosystem

These solutions are more complex to configure, outside this article’s scope. Briefly mentioning Fluentd configuration approach:

logging:
  driver: "fluentd"
  options:
    fluentd-address: "localhost:24224"
    tag: "docker.{{.Name}}"

Fluentd forwards logs to the specified address, where you can collect and analyze them on another server.

Complete Configuration Template

Combine health checks, restart policies, and log management to create a production-grade docker-compose.yml. Here’s a complete example with PostgreSQL database, Redis cache, and API service:

version: '3.8'

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: myuser
      POSTGRES_PASSWORD: mypassword
      POSTGRES_DB: mydb
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U myuser -d mydb"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s
    deploy:
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
        compress: "true"

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD-SHELL", "redis-cli ping | grep PONG"]
      interval: 10s
      timeout: 3s
      retries: 3
      start_period: 5s
    deploy:
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

  api:
    build:
      context: ./api
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      DATABASE_URL: postgres://myuser:mypassword@postgres:5432/mydb
      REDIS_URL: redis://redis:6379
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 10s
    deploy:
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 3
        window: 120s
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
        max-file: "5"
        compress: "true"

volumes:
  postgres_data:

Configuration Key Points

Startup Order: The api container’s depends_on waits for both postgres and redis health checks to pass. Database and cache are ready before API starts, avoiding startup connection errors.

Log Size Differences: postgres and redis logs are usually small, 10MB × 3 is sufficient; API service logs may be larger, set to 50MB × 5. Adjust based on actual log volume, don’t use one size fits all.

Restart Delay Differences: postgres starts slowly and needs time to recover after restart, delay set to 5 seconds; API starts quickly, delay set to 10 seconds gives health checks more buffer.

Startup Warm-up Time: postgres start_period: 30s gives database enough initialization time; redis start_period: 5s, Redis starts fast anyway; API start_period: 10s, application startup usually takes just a few seconds.

This template can be copied and used directly—just replace environment variables and images with your own. If your project has other services (like MongoDB, MinIO), add health checks, restart policies, and log configuration following the same pattern.

Common Pitfalls and Troubleshooting

Configuration written, deployed, problems may still appear. Here are common pitfalls and troubleshooting steps.

Health Check Keeps Failing

Symptom: Container status always shows unhealthy, but application seems to work normally.

Troubleshooting Steps:

  1. First check if the health check command tool exists:

    docker exec <container> which curl
    docker exec <container> which pg_isready

    Alpine images often don’t have curl, need to manually install or switch to wget.

  2. Manually run health check command, check output:

    docker exec <container> curl -f http://localhost:8080/health

    If it returns an error, the health check endpoint itself may have issues.

  3. View detailed health check status:

    docker inspect --format='{{json .State.Health}}' <container> | jq

    You can see recent check results, failure reasons, timestamps.

Container Restarting Repeatedly

Symptom: Container starts and dies after a few seconds, logs filled with restart records.

Troubleshooting Steps:

  1. Check container exit reason:

    docker inspect --format='{{.State.ExitCode}}' <container>
    docker inspect --format='{{.State.Error}}' <container>

    Exit code tells you roughly what’s wrong (1 = general error, 137 = killed by OOM, 139 = segmentation fault).

  2. Check restart count:

    docker inspect --format='{{.RestartCount}}' <container>

    If count is large, check if max_attempts is taking effect.

  3. Check container logs for specific errors:

    docker logs --tail 100 <container>

Log Disk Full

Symptom: Disk space alert, discover /var/lib/docker/containers directory is very large.

Troubleshooting Steps:

  1. Find largest log files:

    du -sh /var/lib/docker/containers/*/*-json.log | sort -rh | head -5
  2. Check if log rotation configuration is effective:

    docker inspect --format='{{.HostConfig.LogConfig}}' <container>

    If output shows Config: {}, log rotation isn’t configured.

  3. Manually clean logs (temporary solution):

    truncate -s 0 /var/lib/docker/containers/<id>/<id>-json.log

    This is a temporary solution, long-term still need to add log rotation configuration.

Quick Troubleshooting Command List

When problems occur, these commands help quickly locate issues:

# View all container health status
docker ps --format "table {{.Names}}\t{{.Status}}"

# View specific container's health check history
docker inspect --format='{{json .State.Health}}' <container>

# View container exit code and restart count
docker inspect --format='ExitCode: {{.State.ExitCode}}, RestartCount: {{.RestartCount}}' <container>

# Check log file sizes
du -sh /var/lib/docker/containers/*/*-json.log | sort -rh

# View container's last 100 log lines
docker logs --tail 100 <container>

Summary

For production deployment with Docker Compose, these three configurations aren’t optional—they’re essential: health checks make containers not just “look alive”, restart policies give failures a chance for auto-recovery while limiting infinite loops, log management prevents disk exhaustion.

3.2 hours
Average troubleshooting time for container zombie states

Core Configuration Checklist:

  • Health Check: test + interval + timeout + retries + start_period
  • Restart Policy: condition: on-failure + max_attempts: 3
  • Log Rotation: max-size: 10m + max-file: 3 + compress: true

Three-Step Action Plan:

  1. Check your existing docker-compose.yml for these three configurations. If missing, at least add health checks and log rotation.
  2. Deploy a test service using the complete template above, observe if health checks work and logs are rotating.
  3. Save the troubleshooting commands. Next time you get a 3 AM alert, you can quickly locate the problem.

Don’t let your containers run naked in production. Configure these three “protective shields”—when problems occur, at least they can auto-recover, be quickly troubleshooted, and won’t fill up your disk.

FAQ

How should I configure health check interval and timeout?
Interval should be 10-30 seconds, timeout 3-10 seconds. The key is timeout shouldn't exceed interval, otherwise the next check starts before the previous one finishes. Database services should have longer start_period (30-60 seconds) to allow initialization time.
Is restart: always or on-failure better for restart policy?
Production environments recommend on-failure paired with max_attempts. always restarts in any situation, including config errors and code bugs, causing crash loops. on-failure only restarts on abnormal exits, combined with max_attempts limit lets operations discover problems.
What log file size and count are appropriate?
Generally services should use max-size: 10m + max-file: 3, totaling 30MB. High-volume services (like APIs) can use 50m × 5. The key is adjusting based on actual log volume while enabling compress: true to save space.
Container health check keeps failing but application runs normally—what to do?
First confirm the health check command tool exists (curl, pg_isready, etc.), Alpine images often lack tools. Then manually run the check command to see output, finally use docker inspect to view health check history and locate the specific failure reason.
What does depends_on with condition: service_healthy do?
Ensures dependency service's health check passes before starting the current container. More reliable than simple depends_on, avoids API trying to connect before database is ready causing startup failure. Requires dependency service to have healthcheck configured.
How to quickly check how much disk space container logs use?
Use the command: du -sh /var/lib/docker/containers/*/*-json.log | sort -rh. If you find a single container's logs are particularly large, check if that container's log rotation configuration is effective.

10 min read · Published on: Apr 12, 2026 · Modified on: Apr 12, 2026

Comments

Sign in with GitHub to leave a comment

Related Posts