Skip to content

CUDA Out of Memory: Resolving PyTorch Memory Fragmentation

Problem Statement

When training PyTorch models on CUDA devices, you may encounter this memory allocation error:

RuntimeError: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0;
15.90 GiB total capacity; 12.04 GiB already allocated; 2.72 GiB free; 
12.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated
memory try setting max_split_size_mb to avoid fragmentation.

This occurs when PyTorch's memory management struggles with memory fragmentation—reserving more memory than actually allocated due to inefficient block allocation. Common triggers include:

  • Using high-resolution images or complex models
  • Insufficient memory cleanup between training runs
  • Suboptimal memory allocation strategies
  • Multi-GPU training setups

WARNING

This error often persists even after reducing batch sizes, making max_split_size_mb adjustments crucial for memory-intensive workflows

Solutions

Setting max_split_size_mb via Environment Variable

The most common fix is to configure PyTorch's memory allocator with max_split_size_mb. This prevents splitting memory blocks beyond a specified size, reducing fragmentation.

bash
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"
pwsh
$env:PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"
cmd
set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

Within Python scripts:

python
import os

# Set before initializing any CUDA operations
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"

# Rest of your PyTorch code follows

TIP

Optimal values for max_split_size_mb vary:

  1. Start with 512 (recommended default)
  2. Increase if out-of-memory errors persist (try 1024, 2048)
  3. Decrease (256, 128) if performance degrades significantly
  4. Use PyTorch's diagnostic tools:
    python
    print(torch.cuda.memory_summary())
    print(torch.cuda.memory_stats())

Freeing Reserved Memory with Cache Clear

Force PyTorch to release reserved memory:

python
import torch

# Clear cache before critical operations
torch.cuda.empty_cache()

# Especially useful between training epochs
# or before large memory allocations

Warning

empty_cache() has computational overhead. Overuse can degrade performance

Handling Multi-GPU Memory Issues

For distributed training errors when using torch.distributed.launch:

python
import argparse
import torch

parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)  # Added argument
args = parser.parse_args()

# Set correct device for current process
torch.cuda.set_device(args.local_rank)  # Critical fix

# Initialize your model and data here
model = YourModel().cuda()
bash
# Without fix (causes error)
python -m torch.distributed.launch --nproc_per_node=4 train.py

# Run after adding set_device() in train.py
python -m torch.distributed.launch --nproc_per_node=4 \
  train.py <your_arguments>

Additional Considerations

Hardware-Specific Notes

  • NVIDIA GTX 16XX Series: Install driver v531+ for improved VRAM management
  • Stable Diffusion Users: Combine with --medvram or --lowvram flags

Prevention Strategies

  • Reduce input data dimensions
  • Replace unused variables with del object followed by gc.collect()
  • Use AMP (Automatic Mixed Precision):
    python
    scaler = torch.cuda.amp.GradScaler()
    with torch.cuda.amp.autocast():
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Troubleshooting Flow

Key Takeaways

  1. Set max_split_size_mb as a first fix for fragmentation
  2. Use torch.cuda.empty_cache() for memory-critical operations
  3. Always pin GPU devices for distributed training workflows
  4. Update GPU drivers for newer architectures