NVML Unknown Error in Docker

A common issue with NVIDIA GPU containers is the "Failed to initialize NVML: Unknown Error" that occurs after containers have been running for hours or days. This guide covers the root causes and effective solutions for maintaining stable GPU access in Docker containers.

Problem Overview

The "Failed to initialize NVML: Unknown Error" typically manifests as:

GPU access works initially when starting a container
After several hours or days, nvidia-smi fails with the error
Host machine continues to see GPUs correctly
Restarting the container temporarily resolves the issue

Root Causes

The primary cause is related to cgroup management and systemd interactions:

Systemd daemon-reload events that trigger reloading of Unit files with GPU references
Cgroup version mismatches between host and container
Driver/library version inconsistencies after system updates

Diagnostic Test

To confirm if your issue is related to systemd daemon-reload:

bash

# On the host machine, run:
sudo systemctl daemon-reload

If this immediately causes nvidia-smi to fail in your container, you're experiencing the cgroup/systemd compatibility issue.

Effective Solutions

Solution 1: Modify NVIDIA Container Runtime Configuration

Edit config.tomlConfiguration change

bash

sudo nano /etc/nvidia-container-runtime/config.toml

toml

# Change from:
# no-cgroups = true

# To:
no-cgroups = false

After making this change, restart Docker:

bash

sudo systemctl restart docker

Test the configuration:

bash

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Solution 2: Docker Daemon Configuration (Alternative)

Edit the Docker daemon configuration:

Edit daemon.jsonConfiguration example

bash

sudo nano /etc/docker/daemon.json

json

{
  "runtimes": {
    "nvidia": {
      "args": [],
      "path": "nvidia-container-runtime"
    }
  },
  "exec-opts": ["native.cgroupdriver=cgroupfs"]
}

Restart Docker after making changes:

bash

sudo service docker restart

Solution 3: Kernel Upgrade (Long-term Fix)

Upgrade to Linux kernel version 4.5 or later to use cgroup v2:

bash

# Check current kernel version
uname -r

# Update and upgrade (Ubuntu/Debian)
sudo apt update && sudo apt upgrade

WARNING

Kernel upgrades may require system reboot and could affect other system components. Test in a development environment first.

Solution 4: Health Check with Auto-Restart (Workaround)

For production systems where immediate fixes aren't possible, implement health checks:

Dockerfiledocker-compose.yml

dockerfile

HEALTHCHECK \
    --start-period=60s \
    --interval=20s \
    --timeout=10s \
    --retries=2 \
    CMD nvidia-smi || exit 1

yaml

services:
  gpu_container:
    healthcheck:
      test: ["CMD-SHELL", "nvidia-smi || exit 1"]
      start_period: 1s
      interval: 20s
      timeout: 5s
      retries: 2
    labels:
      - autoheal=true

Use the autoheal container to automatically restart unhealthy containers:

bash

docker run -d \
    --name autoheal \
    --restart=always \
    -e AUTOHEAL_CONTAINER_LABEL=all \
    -v /var/run/docker.sock:/var/run/docker.sock \
    willfarrell/autoheal

Simple Troubleshooting Steps

Before implementing complex solutions, try these simple fixes:

Restart the container - Often resolves temporary issues
Reboot the host system - Clears driver/library mismatches
Verify NVIDIA driver versions match between host and container

bash

# Check host driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader

# Check container driver compatibility
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Prevention and Best Practices

Keep systems updated - Regularly update NVIDIA drivers, Docker, and NVIDIA Container Toolkit
Use compatible versions - Ensure host drivers and container base images are compatible
Monitor systemd events - Be aware of automated system maintenance that might trigger daemon-reload
Implement logging - Track when NVML failures occur to identify patterns

INFO

NVIDIA has acknowledged this issue and may provide official fixes in future releases of the NVIDIA Container Toolkit.

Conclusion

The "Failed to initialize NVML: Unknown Error" in Docker containers is typically resolved by addressing cgroup configuration issues. The most reliable solutions involve modifying the NVIDIA container runtime configuration or Docker daemon settings. For production environments, implementing health checks with auto-restart functionality provides a robust workaround while awaiting permanent fixes from NVIDIA.

Related Posts

NVML Unknown Error in Docker ​

Problem Overview ​

Root Causes ​

Diagnostic Test ​

Effective Solutions ​

Solution 1: Modify NVIDIA Container Runtime Configuration ​

Solution 2: Docker Daemon Configuration (Alternative) ​

Solution 3: Kernel Upgrade (Long-term Fix) ​

Solution 4: Health Check with Auto-Restart (Workaround) ​

Simple Troubleshooting Steps ​

Prevention and Best Practices ​

Conclusion ​

NVML Unknown Error in Docker

Problem Overview

Root Causes

Diagnostic Test

Effective Solutions

Solution 1: Modify NVIDIA Container Runtime Configuration

Solution 2: Docker Daemon Configuration (Alternative)

Solution 3: Kernel Upgrade (Long-term Fix)

Solution 4: Health Check with Auto-Restart (Workaround)

Simple Troubleshooting Steps

Prevention and Best Practices

Conclusion