CUDA Out of Memory: Resolving PyTorch Memory Fragmentation

Problem Statement

When training PyTorch models on CUDA devices, you may encounter this memory allocation error:

RuntimeError: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0;
15.90 GiB total capacity; 12.04 GiB already allocated; 2.72 GiB free; 
12.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated
memory try setting max_split_size_mb to avoid fragmentation.

This occurs when PyTorch's memory management struggles with memory fragmentation—reserving more memory than actually allocated due to inefficient block allocation. Common triggers include:

Using high-resolution images or complex models
Insufficient memory cleanup between training runs
Suboptimal memory allocation strategies
Multi-GPU training setups

WARNING

This error often persists even after reducing batch sizes, making max_split_size_mb adjustments crucial for memory-intensive workflows

Solutions

Setting `max_split_size_mb` via Environment Variable

The most common fix is to configure PyTorch's memory allocator with max_split_size_mb. This prevents splitting memory blocks beyond a specified size, reducing fragmentation.

Linux/MacWindows PowerShellWindows Command Prompt

bash

export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"

pwsh

$env:PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"

cmd

set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

Within Python scripts:

python

import os

# Set before initializing any CUDA operations
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"

# Rest of your PyTorch code follows

TIP

Optimal values for max_split_size_mb vary:

Start with 512 (recommended default)
Increase if out-of-memory errors persist (try 1024, 2048)
Decrease (256, 128) if performance degrades significantly

Use PyTorch's diagnostic tools:

python

print(torch.cuda.memory_summary())
print(torch.cuda.memory_stats())

Freeing Reserved Memory with Cache Clear

Force PyTorch to release reserved memory:

python

import torch

# Clear cache before critical operations
torch.cuda.empty_cache()

# Especially useful between training epochs
# or before large memory allocations

Warning

empty_cache() has computational overhead. Overuse can degrade performance

Handling Multi-GPU Memory Issues

For distributed training errors when using torch.distributed.launch:

train.pyLaunch Command

python

import argparse
import torch

parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)  # Added argument
args = parser.parse_args()

# Set correct device for current process
torch.cuda.set_device(args.local_rank)  # Critical fix

# Initialize your model and data here
model = YourModel().cuda()

bash

# Without fix (causes error)
python -m torch.distributed.launch --nproc_per_node=4 train.py

# Run after adding set_device() in train.py
python -m torch.distributed.launch --nproc_per_node=4 \
  train.py <your_arguments>

Additional Considerations

Hardware-Specific Notes

NVIDIA GTX 16XX Series: Install driver v531+ for improved VRAM management
Stable Diffusion Users: Combine with --medvram or --lowvram flags

Prevention Strategies

Reduce input data dimensions
Replace unused variables with del object followed by gc.collect()

Use AMP (Automatic Mixed Precision):

python

scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast():
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Troubleshooting Flow

Key Takeaways

Set max_split_size_mb as a first fix for fragmentation
Use torch.cuda.empty_cache() for memory-critical operations
Always pin GPU devices for distributed training workflows
Update GPU drivers for newer architectures

Related Posts

CUDA Out of Memory: Resolving PyTorch Memory Fragmentation ​

Problem Statement ​

Solutions ​

Setting max_split_size_mb via Environment Variable ​

Freeing Reserved Memory with Cache Clear ​

Handling Multi-GPU Memory Issues ​

Additional Considerations ​

Hardware-Specific Notes ​

Prevention Strategies ​

Troubleshooting Flow ​

CUDA Out of Memory: Resolving PyTorch Memory Fragmentation

Problem Statement

Solutions

Setting `max_split_size_mb` via Environment Variable

Freeing Reserved Memory with Cache Clear

Handling Multi-GPU Memory Issues

Additional Considerations

Hardware-Specific Notes

Prevention Strategies

Troubleshooting Flow