CUDA Out of Memory: Resolving PyTorch Memory Fragmentation
Problem Statement
When training PyTorch models on CUDA devices, you may encounter this memory allocation error:
RuntimeError: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0;
15.90 GiB total capacity; 12.04 GiB already allocated; 2.72 GiB free;
12.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated
memory try setting max_split_size_mb to avoid fragmentation.
This occurs when PyTorch's memory management struggles with memory fragmentation—reserving more memory than actually allocated due to inefficient block allocation. Common triggers include:
- Using high-resolution images or complex models
- Insufficient memory cleanup between training runs
- Suboptimal memory allocation strategies
- Multi-GPU training setups
WARNING
This error often persists even after reducing batch sizes, making max_split_size_mb
adjustments crucial for memory-intensive workflows
Solutions
Setting max_split_size_mb
via Environment Variable
The most common fix is to configure PyTorch's memory allocator with max_split_size_mb
. This prevents splitting memory blocks beyond a specified size, reducing fragmentation.
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"
$env:PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"
set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
Within Python scripts:
import os
# Set before initializing any CUDA operations
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
# Rest of your PyTorch code follows
TIP
Optimal values for max_split_size_mb
vary:
- Start with
512
(recommended default) - Increase if out-of-memory errors persist (try
1024
,2048
) - Decrease (
256
,128
) if performance degrades significantly - Use PyTorch's diagnostic tools:python
print(torch.cuda.memory_summary()) print(torch.cuda.memory_stats())
Freeing Reserved Memory with Cache Clear
Force PyTorch to release reserved memory:
import torch
# Clear cache before critical operations
torch.cuda.empty_cache()
# Especially useful between training epochs
# or before large memory allocations
Warning
empty_cache()
has computational overhead. Overuse can degrade performance
Handling Multi-GPU Memory Issues
For distributed training errors when using torch.distributed.launch
:
import argparse
import torch
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int) # Added argument
args = parser.parse_args()
# Set correct device for current process
torch.cuda.set_device(args.local_rank) # Critical fix
# Initialize your model and data here
model = YourModel().cuda()
# Without fix (causes error)
python -m torch.distributed.launch --nproc_per_node=4 train.py
# Run after adding set_device() in train.py
python -m torch.distributed.launch --nproc_per_node=4 \
train.py <your_arguments>
Additional Considerations
Hardware-Specific Notes
- NVIDIA GTX 16XX Series: Install driver v531+ for improved VRAM management
- Stable Diffusion Users: Combine with
--medvram
or--lowvram
flags
Prevention Strategies
- Reduce input data dimensions
- Replace unused variables with
del object
followed bygc.collect()
- Use AMP (Automatic Mixed Precision):python
scaler = torch.cuda.amp.GradScaler() with torch.cuda.amp.autocast(): outputs = model(inputs) loss = loss_fn(outputs, targets) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()
Troubleshooting Flow
Key Takeaways
- Set
max_split_size_mb
as a first fix for fragmentation - Use
torch.cuda.empty_cache()
for memory-critical operations - Always pin GPU devices for distributed training workflows
- Update GPU drivers for newer architectures