Skip to content

NVML Unknown Error in Docker

A common issue with NVIDIA GPU containers is the "Failed to initialize NVML: Unknown Error" that occurs after containers have been running for hours or days. This guide covers the root causes and effective solutions for maintaining stable GPU access in Docker containers.

Problem Overview

The "Failed to initialize NVML: Unknown Error" typically manifests as:

  • GPU access works initially when starting a container
  • After several hours or days, nvidia-smi fails with the error
  • Host machine continues to see GPUs correctly
  • Restarting the container temporarily resolves the issue

Root Causes

The primary cause is related to cgroup management and systemd interactions:

  1. Systemd daemon-reload events that trigger reloading of Unit files with GPU references
  2. Cgroup version mismatches between host and container
  3. Driver/library version inconsistencies after system updates

Diagnostic Test

To confirm if your issue is related to systemd daemon-reload:

bash
# On the host machine, run:
sudo systemctl daemon-reload

If this immediately causes nvidia-smi to fail in your container, you're experiencing the cgroup/systemd compatibility issue.

Effective Solutions

Solution 1: Modify NVIDIA Container Runtime Configuration

bash
sudo nano /etc/nvidia-container-runtime/config.toml
toml
# Change from:
# no-cgroups = true

# To:
no-cgroups = false

After making this change, restart Docker:

bash
sudo systemctl restart docker

Test the configuration:

bash
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Solution 2: Docker Daemon Configuration (Alternative)

Edit the Docker daemon configuration:

bash
sudo nano /etc/docker/daemon.json
json
{
  "runtimes": {
    "nvidia": {
      "args": [],
      "path": "nvidia-container-runtime"
    }
  },
  "exec-opts": ["native.cgroupdriver=cgroupfs"]
}

Restart Docker after making changes:

bash
sudo service docker restart

Solution 3: Kernel Upgrade (Long-term Fix)

Upgrade to Linux kernel version 4.5 or later to use cgroup v2:

bash
# Check current kernel version
uname -r

# Update and upgrade (Ubuntu/Debian)
sudo apt update && sudo apt upgrade

WARNING

Kernel upgrades may require system reboot and could affect other system components. Test in a development environment first.

Solution 4: Health Check with Auto-Restart (Workaround)

For production systems where immediate fixes aren't possible, implement health checks:

dockerfile
HEALTHCHECK \
    --start-period=60s \
    --interval=20s \
    --timeout=10s \
    --retries=2 \
    CMD nvidia-smi || exit 1
yaml
services:
  gpu_container:
    healthcheck:
      test: ["CMD-SHELL", "nvidia-smi || exit 1"]
      start_period: 1s
      interval: 20s
      timeout: 5s
      retries: 2
    labels:
      - autoheal=true

Use the autoheal container to automatically restart unhealthy containers:

bash
docker run -d \
    --name autoheal \
    --restart=always \
    -e AUTOHEAL_CONTAINER_LABEL=all \
    -v /var/run/docker.sock:/var/run/docker.sock \
    willfarrell/autoheal

Simple Troubleshooting Steps

Before implementing complex solutions, try these simple fixes:

  1. Restart the container - Often resolves temporary issues
  2. Reboot the host system - Clears driver/library mismatches
  3. Verify NVIDIA driver versions match between host and container
bash
# Check host driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader

# Check container driver compatibility
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Prevention and Best Practices

  1. Keep systems updated - Regularly update NVIDIA drivers, Docker, and NVIDIA Container Toolkit
  2. Use compatible versions - Ensure host drivers and container base images are compatible
  3. Monitor systemd events - Be aware of automated system maintenance that might trigger daemon-reload
  4. Implement logging - Track when NVML failures occur to identify patterns

INFO

NVIDIA has acknowledged this issue and may provide official fixes in future releases of the NVIDIA Container Toolkit.

Conclusion

The "Failed to initialize NVML: Unknown Error" in Docker containers is typically resolved by addressing cgroup configuration issues. The most reliable solutions involve modifying the NVIDIA container runtime configuration or Docker daemon settings. For production environments, implementing health checks with auto-restart functionality provides a robust workaround while awaiting permanent fixes from NVIDIA.