NVML Unknown Error in Docker
A common issue with NVIDIA GPU containers is the "Failed to initialize NVML: Unknown Error" that occurs after containers have been running for hours or days. This guide covers the root causes and effective solutions for maintaining stable GPU access in Docker containers.
Problem Overview
The "Failed to initialize NVML: Unknown Error" typically manifests as:
- GPU access works initially when starting a container
- After several hours or days,
nvidia-smi
fails with the error - Host machine continues to see GPUs correctly
- Restarting the container temporarily resolves the issue
Root Causes
The primary cause is related to cgroup management and systemd interactions:
- Systemd daemon-reload events that trigger reloading of Unit files with GPU references
- Cgroup version mismatches between host and container
- Driver/library version inconsistencies after system updates
Diagnostic Test
To confirm if your issue is related to systemd daemon-reload:
# On the host machine, run:
sudo systemctl daemon-reload
If this immediately causes nvidia-smi
to fail in your container, you're experiencing the cgroup/systemd compatibility issue.
Effective Solutions
Solution 1: Modify NVIDIA Container Runtime Configuration
sudo nano /etc/nvidia-container-runtime/config.toml
# Change from:
# no-cgroups = true
# To:
no-cgroups = false
After making this change, restart Docker:
sudo systemctl restart docker
Test the configuration:
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Solution 2: Docker Daemon Configuration (Alternative)
Edit the Docker daemon configuration:
sudo nano /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
},
"exec-opts": ["native.cgroupdriver=cgroupfs"]
}
Restart Docker after making changes:
sudo service docker restart
Solution 3: Kernel Upgrade (Long-term Fix)
Upgrade to Linux kernel version 4.5 or later to use cgroup v2:
# Check current kernel version
uname -r
# Update and upgrade (Ubuntu/Debian)
sudo apt update && sudo apt upgrade
WARNING
Kernel upgrades may require system reboot and could affect other system components. Test in a development environment first.
Solution 4: Health Check with Auto-Restart (Workaround)
For production systems where immediate fixes aren't possible, implement health checks:
HEALTHCHECK \
--start-period=60s \
--interval=20s \
--timeout=10s \
--retries=2 \
CMD nvidia-smi || exit 1
services:
gpu_container:
healthcheck:
test: ["CMD-SHELL", "nvidia-smi || exit 1"]
start_period: 1s
interval: 20s
timeout: 5s
retries: 2
labels:
- autoheal=true
Use the autoheal container to automatically restart unhealthy containers:
docker run -d \
--name autoheal \
--restart=always \
-e AUTOHEAL_CONTAINER_LABEL=all \
-v /var/run/docker.sock:/var/run/docker.sock \
willfarrell/autoheal
Simple Troubleshooting Steps
Before implementing complex solutions, try these simple fixes:
- Restart the container - Often resolves temporary issues
- Reboot the host system - Clears driver/library mismatches
- Verify NVIDIA driver versions match between host and container
# Check host driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader
# Check container driver compatibility
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Prevention and Best Practices
- Keep systems updated - Regularly update NVIDIA drivers, Docker, and NVIDIA Container Toolkit
- Use compatible versions - Ensure host drivers and container base images are compatible
- Monitor systemd events - Be aware of automated system maintenance that might trigger daemon-reload
- Implement logging - Track when NVML failures occur to identify patterns
INFO
NVIDIA has acknowledged this issue and may provide official fixes in future releases of the NVIDIA Container Toolkit.
Conclusion
The "Failed to initialize NVML: Unknown Error" in Docker containers is typically resolved by addressing cgroup configuration issues. The most reliable solutions involve modifying the NVIDIA container runtime configuration or Docker daemon settings. For production environments, implementing health checks with auto-restart functionality provides a robust workaround while awaiting permanent fixes from NVIDIA.