Optimizing Memory Usage in PyTorch Models

By Nahla Davies on January 30, 2025 in Practical Machine Learning 0

Optimizing Memory Usage in PyTorch Models
Image by Editor | Midjourney

Even though more than 40% of businesses say they’re pleased with AI, many are unhappy with out-of-the-box solutions, resulting in a need for local AI solutions and their subsequent tweaking with PyTorch.

Consequently, efficient memory management is crucial when working with PyTorch, especially for training and deploying deep learning models on resource-constrained systems. Without proper optimization, large models can quickly exhaust available memory, causing performance bottlenecks or outright crashes.

To combat the lack of optimization, we prepared this guide. It dives into strategies for optimizing memory usage in PyTorch, covering key techniques to maximize efficiency while maintaining model performance.

Use Mixed Precision Training

Mixed precision training leverages both 16-bit and 32-bit floating-point computations to reduce memory consumption and accelerate training. PyTorch’s torch.cuda.amp module makes this straightforward to implement:

from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for inputs, labels in dataloader:
    optimizer.zero_grad()
    with autocast():
        outputs = model(inputs)
        loss = loss_fn(outputs, labels)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()

model = MyModel().cuda()

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for inputs, labels in dataloader:

optimizer.zero_grad()

with autocast():

outputs = model(inputs)

loss = loss_fn(outputs, labels)

scaler.scale(loss).backward()

scaler.step(optimizer)

scaler.update()

This approach not only reduces memory usage but often speeds up training on modern GPUs with tensor cores.

Leverage Gradient Checkpointing

Gradient checkpointing trades computational overhead for memory savings. By storing only a subset of intermediate activations and recomputing them during the backward pass, memory usage can be significantly reduced.

Enable gradient checkpointing in PyTorch using the torch.utils.checkpoint module:

from torch.utils.checkpoint import checkpoint

class CheckpointedModel(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, x):
        return checkpoint(self.model, x)

model = CheckpointedModel(MyModel())

from torch.utils.checkpoint import checkpoint

class CheckpointedModel(torch.nn.Module):

def __init__(self, model):

super().__init__()

self.model = model

def forward(self, x):

return checkpoint(self.model, x)

model = CheckpointedModel(MyModel())

This technique is particularly effective for deep models with numerous layers.

Optimize Data Loading with torch.utils.data

Memory inefficiencies often arise during data loading. To minimize these issues, use the DataLoader class with the following optimizations:

Use pin_memory for GPU training: Ensures faster data transfer between host and GPU memory
Set prefetch_factor and num_workers: Adjust these parameters to enable parallel data loading, reducing idle GPU time

Example:

from torch.utils.data import DataLoader

data_loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,
    pin_memory=True
)

from torch.utils.data import DataLoader

data_loader = DataLoader(

dataset,

batch_size=64,

shuffle=True,

num_workers=4,

pin_memory=True

)

Employ Model Quantization

Quantization reduces the numerical precision of model parameters and computations, decreasing memory usage without significant performance degradation. PyTorch supports both static (applied before deployment) and dynamic quantization (applied during inference):

Here’s an example of dynamic quantization:

from torch.quantization import quantize_dynamic

model = MyModel()
quantized_model = quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

from torch.quantization import quantize_dynamic

model = MyModel()

quantized_model = quantize_dynamic(

model, {torch.nn.Linear}, dtype=torch.qint8

)

This approach is ideal for deployment scenarios where memory constraints are critical.

Reduce Redundant Variables

Excessive use of temporary variables can bloat memory usage. Use in-place operations and free up variables explicitly when they’re no longer needed:

# In-place operation
x = x.add_(y)

# Free memory
del y

# In-place operation

x = x.add_(y)

# Free memory

del y

Monitor your memory usage during training using torch.cuda.memory_allocated() and torch.cuda.memory_reserved() to identify potential leaks.

Adjust Batch Sizes Dynamically

The batch size directly affects memory usage. For memory-constrained environments, employ a dynamic batch-sizing strategy:

Start with a large batch size
Gradually reduce it if a CUDA out of memory error occurs

Here’s an example of batch size adjusting in action:

batch_size = 64

while True:
    try:
        train(model, dataloader, batch_size)
        break
    except RuntimeError as e:
        if "out of memory" in str(e):
            batch_size //= 2
        else:
            raise e

batch_size = 64

while True:

try:

train(model, dataloader, batch_size)

break

except RuntimeError as e:

if "out of memory" in str(e):

batch_size //= 2

else:

raise e

Prune Unnecessary Parameters

Pruning eliminates redundant weights from the model, reducing its memory footprint. PyTorch provides the torch.nn.utils.prune module for structured and unstructured pruning:

import torch.nn.utils.prune as prune
prune.l1_unstructured(model.layer, name='weight', amount=0.4)

1 2	import torch.nn.utils.prune as prune prune.l1_unstructured(model.layer, name='weight', amount=0.4)

After pruning, fine-tune the model to recover performance.

Use Distributed Training with Memory Splitting

When training large models, distribute computations and memory across multiple GPUs. PyTorch’s torch.nn.DataParallel and torch.nn.parallel.DistributedDataParallel facilitate this:

model = torch.nn.DataParallel(model)

1	model = torch.nn.DataParallel(model)

Alternatively, consider tensor parallelism for memory-intensive models, splitting layers across GPUs to reduce memory bottlenecks.

Monitor and Profile Memory Usage

Effective memory optimization begins with understanding your model’s memory usage. PyTorch’s torch.utils.bottleneck and third-party tools like PyTorch Profiler and nvidia-smi provide detailed insights.

Example using torch.profiler:

import torch.profiler as profiler

with profiler.profile(on_trace_ready=profiler.tensorboard_trace_handler('./log')) as prof:
    model(inputs)

print(prof.key_averages().table())

import torch.profiler as profiler

with profiler.profile(on_trace_ready=profiler.tensorboard_trace_handler('./log')) as prof:

model(inputs)

print(prof.key_averages().table())

Profiling tools like these allow you to pinpoint memory bottlenecks, enabling targeted optimizations. For instance, you may discover that certain layers or operations disproportionately consume memory, guiding your efforts to optimize or replace them.

Adopt Efficient Deployment Strategies

For deployment, optimize model serialization with TorchScript or ONNX export. These formats reduce memory requirements while maintaining compatibility:

traced_model = torch.jit.trace(model, example_input)
torch.jit.save(traced_model, "optimized_model.pt")

1 2	traced_model = torch.jit.trace(model, example_input) torch.jit.save(traced_model, "optimized_model.pt")

Utilize ONNX for compatibility with other frameworks or inference engines, enabling broader deployment options. Such optimizations not only save memory but also improve inference speed, making them critical for production environments.

Additionally, consider implementing weight sharing or tensor compression techniques for further memory savings during deployment. Compression libraries like LiteRT (formerly known as TensorFlow Lite) or custom PyTorch compression scripts can be helpful in reducing the overall model size without significant performance trade-offs.

For large-scale production systems, leveraging cloud-based GPU server hosting solutions ensures scalability and consistent performance without being constrained by local hardware limitations. This facilitates deploying multiple models simultaneously and reduces operational overhead.

Tips for More Efficient Memory Optimization

Use Sparse Matrices

If your model or dataset contains a large number of zero entries, using sparse matrices can reduce memory usage significantly. PyTorch provides torch.sparse modules for operations on sparse tensors:

sparse_tensor = torch.sparse_coo_tensor(indices, values, size)

1	sparse_tensor = torch.sparse_coo_tensor(indices, values, size)

Sparse tensors are particularly useful in scenarios like natural language processing, where embeddings often contain many zero values.

Apply Knowledge Distillation

Knowledge distillation involves training a smaller, more memory-efficient model to mimic the performance of a larger model. This approach can significantly reduce memory usage while maintaining similar levels of accuracy.

teacher_model = MyLargeModel()
student_model = MySmallModel()

# Train student model to mimic teacher's predictions
for inputs in dataloader:
    teacher_outputs = teacher_model(inputs)
    student_outputs = student_model(inputs)
    loss = loss_fn(student_outputs, teacher_outputs)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

teacher_model = MyLargeModel()

student_model = MySmallModel()

# Train student model to mimic teacher's predictions

for inputs in dataloader:

teacher_outputs = teacher_model(inputs)

student_outputs = student_model(inputs)

loss = loss_fn(student_outputs, teacher_outputs)

optimizer.zero_grad()

loss.backward()

optimizer.step()

Offload Layers Dynamically

When working with extremely large models, consider offloading certain layers or computations to CPU memory or even disk storage. Libraries like Hugging Face’s Accelerate make this process seamless by managing the offloading dynamically based on memory constraints.

Conclusion

Memory optimization in PyTorch is a multi-faceted process involving adjustments at the training, deployment, and infrastructure levels. Techniques like mixed precision training, gradient checkpointing, quantization, and pruning significantly reduce memory consumption.

Complement these strategies with effective profiling tools, sparse matrix operations, and knowledge distillation to ensure efficient resource utilization without sacrificing performance. By leveraging cloud resources and advanced deployment strategies, developers can maximize PyTorch’s capabilities even in the most resource-constrained environments.

Navigation