Train Your Large Model on Multiple GPUs with Pipeline Parallelism

Some language models are too large to train on a single GPU. When the model fits on a single GPU but cannot be trained with a large batch size, you can use data parallelism. However, when the model is too large to fit on a single GPU, you need to split it across multiple GPUs. In this article, you will learn how to use pipeline parallelism to split models for training. In particular, you will learn about:

  • What is pipeline parallelism
  • How to use pipeline parallelism in PyTorch
  • How to save and restore the model with pipeline parallelism

Let’s get started!

Train Your Large Model on Multiple GPUs with Pipeline Parallelism.
Photo by Ivan Ivankovic. Some rights reserved.

Overview

This article is divided into six parts; they are:

  • Pipeline Parallelism Overview
  • Model Preparation for Pipeline Parallelism
  • Stage and Pipeline Schedule
  • Training Loop
  • Distributed Checkpointing
  • Limitations of Pipeline Parallelism

Pipeline Parallelism Overview

Pipeline parallelism means creating the model as a pipeline of stages. If you have worked on a scikit-learn project, you may be familiar with the concept of a pipeline. An example of a scikit-learn pipeline is:

When you pass data to this pipeline, it is processed by the first stage (StandardScaler), and the output is passed to the second stage (LogisticRegression).

A transformer model is typically just a stack of transformer blocks. Each block takes one tensor as input and produces one tensor as output. This makes it a perfect candidate for a pipeline: each stage is a transformer block, and the blocks are chained together. Executing the pipeline is mathematically equivalent to executing the model.

With a transformer model, it is straightforward to create a pipeline manually. At a high level, all you need to do is the following:

However, this method is not efficient. When you run the stage1 model on GPU 0, GPUs 1 and 2 are idle. Only after stage1 finishes and the tensor output1 is ready, can you work on the stage2 model on GPU 1, and so on.

In PyTorch, there is infrastructure for managing the pipeline to keep all GPUs busy. This is based on the concept of micro-batches: instead of processing a batch of size $N$, you split the batch into $n$ micro-batches of size $N/n$ each. When stage2 processes the $i$-th micro-batch, stage1 can process the $(i+1)$-th micro-batch. Once all micro-batches are processed, aggregate the results to produce the final output.

Let’s see how you can implement a training script for pipeline parallelism in PyTorch.

Warning: The PyTorch pipeline parallelism API is experimental and may change in the future. The code in this article was tested on PyTorch 2.9.1. Running the code on a different PyTorch version may not work.

Model Preparation for Pipeline Parallelism

If your model can fit on a single GPU, distributed data parallel is preferable. When you need pipeline parallelism, your model is likely too large to fit on a single device.

Before you set up the pipeline, you need to create your model first. You have two options: either create the model for one stage so it fits on your GPU, or create the full model on a fake device and then trim it before transferring it to an actual GPU. The former requires defining your model with a stage argument in its constructor so that a particular stage can be created. For the latter, you can do the following:

The model is created using the class LlamaForPretraining defined in the previous post. If the model is too large, instantiating it would cause an out-of-memory error. Here, you create the model on a fake device meta. When a model is created on meta, the weights are not allocated.

In the code above, you partition the model into three stages: at rank 0 (the first stage), the model keeps the embedding layer and the first 1/3 of the decoder layers. At rank 1 (the second stage), the model keeps only the middle 1/3 of the decoder layers. At rank 2 (the third stage), the model keeps the last 1/3 of the decoder layers, the final normalization layer, and the prediction head. Components not needed in a particular stage are set to None. These stages have no overlap and tightly partition the model.

To make such a model work, you need to modify the model code so that when a component is None, it is skipped in the forward pass. This needs to be done in the classes LlamaModel and LlamaForPretraining:

You can see that several if-statements are added to check if the component is None before allowing it to process the hidden_states tensor.

After you create the partial model, you need to transfer it to the actual GPU. Transferring a model from the meta device to a real GPU device is done using the method to_empty(), not to(), as you need to allocate the weight tensors during the transfer:

The function reset_all_weights() calls the reset_parameters() method on all model components. This initializes the weights correctly, such as setting the weights to normally distributed random values in nn.Linear modules or to all ones in nn.RMSNorm modules.

Stage and Pipeline Schedule

In PyTorch, pipeline parallelism should be executed with the torchrun command rather than running the script directly. This means multiple processes will be launched, each handling a stage of the pipeline.

When you write a script for torchrun, remember that multiple processes will execute the same script, and each process should operate only on its own scope of work. In pipeline parallelism, this means:

  1. The script should create only one stage of the model
  2. The script should set up a pipeline to allow communication between stages

The key is to use the process group in the torch.distributed module. When torchrun launches multiple processes, the total number of processes is called the world size. Each process has a unique rank. If you run these processes across multiple computers on a network, each process may be assigned a particular GPU device on a machine. The local rank identifies the device ID.

As with distributed data parallel, you should initialize the distributed environment before you set up the pipeline:

Then, you can create the stage object. It specifies which stage your model belongs to, which device it should run on, and how many stages there are in total:

Now that you have set up the model pipeline, you still need to specify how the data is processed into micro-batches within it. PyTorch offers multiple algorithms to utilize the pipeline, called schedules. The default is to use ScheduleGPipe:

As mentioned above, the transformer model you used is a stack of transformer blocks, each of which takes one tensor as input and produces one tensor as output. In pipeline parallelism, you do not explicitly run the model’s forward and backward passes; instead, you use the pipeline schedule to coordinate the stages.

Recall that the backward pass uses the output from the forward pass to compute the loss metric, then propagates the gradient back to the model parameters based on the loss. For the pipeline schedule to know how to trigger the backward pass, you need to implement a loss function, such as loss_fn() above.

The n_microbatches argument specifies how to split the batch into micro-batches. When you use pipeline parallelism, PyTorch expects a batched tensor as input to the pipeline schedule, which is then split and fed into the pipeline stages sequentially.

Micro-batches are key to keeping all GPUs busy, as each stage can process a different micro-batch in parallel. Once all micro-batches are processed, you aggregate the results to get the final output and perform gradient updates. This completes one training step; you then proceed to the next batch.

Not all GPUs are busy at all times. The number of idle GPUs and the duration of idle time are collectively referred to as the bubble. Pipeline scheduling algorithms vary in how they minimize bubble formation, which is critical to the efficiency of pipeline parallelism.

Bubbles in pipeline parallelism: The numbered boxes represent micro-batches processed by the devices; typically, the backward pass takes at least twice as long as the forward pass. The grey area means the devices are idle. The illustration is from Fig. 3 of Narayanan et al. (2021).

Training Loop

Once you have instantiated the partial model, created the pipeline stage object, and configured the schedule, the data loader, optimizer, and learning rate scheduler are the same as in single-GPU training.

However, in the training loop, you should use the pipeline schedule for the forward and backward passes. You should not call the model or compute the loss metric directly. Moreover, each stage of the pipeline works differently in the training loop. Below is how you should modify the training loop for pipeline parallelism:

You create the model object but never call it directly in the training loop. Instead, you pass the input tensor input_ids to the pipeline schedule if you are at rank 0. This is how you send the input to the first stage of the pipeline. For the remaining stages, call schedule.step() to have the pipeline process the output from the previous stage. In the final stage, you expect the model to produce its output. You provide the target tensor target_ids to signal that the loss function should be called to compute the loss metric and trigger the backward pass. The loss metric is not used explicitly in the training loop, as the pipeline schedule handles it internally. However, you can provide a Python list in the losses argument to store the loss metrics for each micro-batch.

After the model completes its forward and backward passes, the gradient is computed and stored with the model. You can then perform the usual gradient update processes, including gradient clipping, optimizer step, and learning rate scheduler update.

Since multiple processes will be running concurrently, you want to keep your output clean. Therefore, the tqdm progress bar is displayed only on the last stage, where you can collect the loss metric and print it. Note that cross-entropy loss is averaged per prediction by default, so it is averaged across all micro-batches to make it comparable to single-GPU training.

Distributed Checkpointing

Pipeline parallelism is unique in that no process contains the full model. Therefore, you cannot use model.state_dict() to get the model weights and save them with torch.save().

Saving the model with pipeline parallelism is tricky: you need to ensure all processes save the model simultaneously, preventing one process from having updated gradients while another does not. You also want to avoid reassembling the full model in any process to maintain speed.

In PyTorch, you need to use the distributed checkpointing API for this purpose. You typically save both the model and optimizer state together since they are tightly coupled. Below is a save function:

Before you save, call dist.barrier() to synchronize all processes. After you save, call dist.barrier() again to ensure all save operations are complete before resuming training, preventing partial gradient updates.

Unlike torch.save(), you do not save to a single file. Instead, each process saves to a different file based on its rank. You also do not use model.state_dict() for this purpose. The save() function takes a checkpoint ID, which is the directory name to use. The file created by each process will be named __3_0.distcp for rank 3, for example. This is not in the same format as files created by torch.save().

To restore the model, you use a similar workflow:

The load() function is similar to save(): you need to pass a checkpoint ID and a dictionary of states. Unlike torch.load(), which returns a state dictionary, this method loads the checkpoint in-place. Therefore, using get_state_dict() to retrieve the model and optimizer weights and states is necessary.

Since load() updates the weights in-place, you need to call it with the correct arguments and fence it with dist.barrier() to ensure all processes are synchronized. However, some models may override the load_state_dict() method to perform additional operations. To be safe, you can call set_state_dict() as shown above to trigger the load_state_dict() method on both the model and optimizer. This does not harm if in-place weight updates are sufficient.

Also note that if you have other objects not managed by the pipeline, such as the learning rate scheduler, you still need to use torch.save() and torch.load() to save and restore them.

That’s all that’s needed to run model training with pipeline parallelism. For completeness, below is the full code:

Be sure to run this script with the torchrun command. For example, on a single computer with 3 GPUs:

If you need to run it on multiple machines, you should use the commands:

Limitations of Pipeline Parallelism

Comparing the model code from the previous post and the code above, you can see that the model no longer takes the attention mask as input. Instead, the attention function in the class LlamaAttention is called with is_causal=True to create a causal attention mask internally.

Numerically, these two implementations are equivalent, as the training loss ignores the padding tokens. However, without the padding mask, you spend more time computing attention weights that are not used.

This modification is necessary to use pipeline parallelism, as the pipeline schedule does not work well when the model takes two arguments in the forward pass. This may improve in the future, as the PyTorch pipeline-parallelism API is still experimental.

Further Readings

Below are some resources that you may find useful:

Summary

In this article, you learned about pipeline parallelism and how to use it in PyTorch. Specifically, you learned:

  • Pipeline parallelism is a technique to train a model on multiple GPUs by splitting the model into multiple stages.
  • The pipeline schedule coordinates the pipeline’s stages.
  • Distributed checkpointing is used to save and restore the model weights and optimizer state in a distributed environment, since you no longer have a single process with access to the full model.
  • There are limitations in the current PyTorch pipeline-parallelism API. Your model may require modifications to support pipeline parallelism.

No comments yet.

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.