CUDA Out of Memory: Resolving PyTorch Memory Fragmentation
Problem Statement
When training PyTorch models on CUDA devices, you may encounter this memory allocation error:
RuntimeError: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0;
15.90 GiB total capacity; 12.04 GiB already allocated; 2.72 GiB free;
12.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated
memory try setting max_split_size_mb to avoid fragmentation.This occurs when PyTorch's memory management struggles with memory fragmentation—reserving more memory than actually allocated due to inefficient block allocation. Common triggers include:
- Using high-resolution images or complex models
- Insufficient memory cleanup between training runs
- Suboptimal memory allocation strategies
- Multi-GPU training setups
WARNING
This error often persists even after reducing batch sizes, making max_split_size_mb adjustments crucial for memory-intensive workflows
Solutions
Setting max_split_size_mb via Environment Variable
The most common fix is to configure PyTorch's memory allocator with max_split_size_mb. This prevents splitting memory blocks beyond a specified size, reducing fragmentation.
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"$env:PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512Within Python scripts:
import os
# Set before initializing any CUDA operations
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
# Rest of your PyTorch code followsTIP
Optimal values for max_split_size_mb vary:
- Start with
512(recommended default) - Increase if out-of-memory errors persist (try
1024,2048) - Decrease (
256,128) if performance degrades significantly - Use PyTorch's diagnostic tools:python
print(torch.cuda.memory_summary()) print(torch.cuda.memory_stats())
Freeing Reserved Memory with Cache Clear
Force PyTorch to release reserved memory:
import torch
# Clear cache before critical operations
torch.cuda.empty_cache()
# Especially useful between training epochs
# or before large memory allocationsWarning
empty_cache() has computational overhead. Overuse can degrade performance
Handling Multi-GPU Memory Issues
For distributed training errors when using torch.distributed.launch:
import argparse
import torch
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int) # Added argument
args = parser.parse_args()
# Set correct device for current process
torch.cuda.set_device(args.local_rank) # Critical fix
# Initialize your model and data here
model = YourModel().cuda()# Without fix (causes error)
python -m torch.distributed.launch --nproc_per_node=4 train.py
# Run after adding set_device() in train.py
python -m torch.distributed.launch --nproc_per_node=4 \
train.py <your_arguments>Additional Considerations
Hardware-Specific Notes
- NVIDIA GTX 16XX Series: Install driver v531+ for improved VRAM management
- Stable Diffusion Users: Combine with
--medvramor--lowvramflags
Prevention Strategies
- Reduce input data dimensions
- Replace unused variables with
del objectfollowed bygc.collect() - Use AMP (Automatic Mixed Precision):python
scaler = torch.cuda.amp.GradScaler() with torch.cuda.amp.autocast(): outputs = model(inputs) loss = loss_fn(outputs, targets) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()
Troubleshooting Flow
Key Takeaways
- Set
max_split_size_mbas a first fix for fragmentation - Use
torch.cuda.empty_cache()for memory-critical operations - Always pin GPU devices for distributed training workflows
- Update GPU drivers for newer architectures