Train Your Large Model on Multiple GPUs with Pipeline Parallelism

By Adrian Tam on January 24, 2026 in Training Transformer Models 0

Some language models are too large to train on a single GPU. When the model fits on a single GPU but cannot be trained with a large batch size, you can use data parallelism. However, when the model is too large to fit on a single GPU, you need to split it across multiple GPUs. In this article, you will learn how to use pipeline parallelism to split models for training. In particular, you will learn about:

What is pipeline parallelism
How to use pipeline parallelism in PyTorch
How to save and restore the model with pipeline parallelism

Let’s get started!

Train Your Large Model on Multiple GPUs with Pipeline Parallelism.
Photo by Ivan Ivankovic. Some rights reserved.

Overview

This article is divided into six parts; they are:

Pipeline Parallelism Overview
Model Preparation for Pipeline Parallelism
Stage and Pipeline Schedule
Training Loop
Distributed Checkpointing
Limitations of Pipeline Parallelism

Pipeline Parallelism Overview

Pipeline parallelism means creating the model as a pipeline of stages. If you have worked on a scikit-learn project, you may be familiar with the concept of a pipeline. An example of a scikit-learn pipeline is:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([

('scaler', StandardScaler()),

('classifier', LogisticRegression())

])

When you pass data to this pipeline, it is processed by the first stage (StandardScaler), and the output is passed to the second stage (LogisticRegression).

A transformer model is typically just a stack of transformer blocks. Each block takes one tensor as input and produces one tensor as output. This makes it a perfect candidate for a pipeline: each stage is a transformer block, and the blocks are chained together. Executing the pipeline is mathematically equivalent to executing the model.

With a transformer model, it is straightforward to create a pipeline manually. At a high level, all you need to do is the following:

stage1 = TransformerBlock().to("cuda:0")
stage2 = TransformerBlock().to("cuda:1")
stage3 = TransformerBlock().to("cuda:2")

batch_size, seq_length, hidden_size = 4, 512, 768
input_tensor = torch.randn(batch_size, seq_length, hidden_size).to("cuda:0")

output1 = stage1(input_tensor)
output2 = stage2(output1.to("cuda:1"))
output3 = stage3(output2.to("cuda:2"))

stage1 = TransformerBlock().to("cuda:0")

stage2 = TransformerBlock().to("cuda:1")

stage3 = TransformerBlock().to("cuda:2")

batch_size, seq_length, hidden_size = 4, 512, 768

input_tensor = torch.randn(batch_size, seq_length, hidden_size).to("cuda:0")

output1 = stage1(input_tensor)

output2 = stage2(output1.to("cuda:1"))

output3 = stage3(output2.to("cuda:2"))

However, this method is not efficient. When you run the stage1 model on GPU 0, GPUs 1 and 2 are idle. Only after stage1 finishes and the tensor output1 is ready, you may work on the stage2 model on GPU 1, and so on.

In PyTorch, there is infrastructure for managing the pipeline to keep all GPUs busy. This is based on the concept of micro-batches: instead of processing a batch of size $N$, you split the batch into $n$ micro-batches of size $N/n$ each. When stage2 processes the $i$-th micro-batch, stage1 can process the $(i+1)$-th micro-batch. Once all micro-batches are processed, aggregate the results to produce the final output.

Let’s see how you can implement a training script for pipeline parallelism in PyTorch.

Warning: The PyTorch pipeline parallelism API is experimental and may change in the future. The code in this article was tested on PyTorch 2.9.1. Running the code on a different PyTorch version may not work.

Model Preparation for Pipeline Parallelism

If your model can fit on a single GPU, distributed data parallel is preferable. When you need pipeline parallelism, your model is likely too large to fit on a single device.

Before you set up the pipeline, you need to create your model first. You have two options: either create the model for one stage so it fits on your GPU, or create the full model on a fake device and then trim it before transferring it to an actual GPU. The former requires defining your model with a stage argument in its constructor so that a particular stage can be created. For the latter, you can do the following:

...
with torch.device("meta"):
    model_config = LlamaConfig()
    model = LlamaForPretraining(model_config, stage=rank)
    # Partition the model by removing some layers
    num_layers = model_config.num_hidden_layers
    partition = [num_layers // 3, 2 * num_layers // 3, num_layers]
    if rank == 0:
        # from embedding to 1/3 of the decoder layers
        for n in range(partition[0], partition[2]):
            model.base_model.layers[str(n)] = None
        model.base_model.norm = None
        model.lm_head = None
    elif rank == 1:
        # from 1/3 to 2/3 of the decoder layers
        model.base_model.embed_tokens = None
        for n in range(0, partition[0]):
            model.base_model.layers[str(n)] = None
        for n in range(partition[1], partition[2]):
            model.base_model.layers[str(n)] = None
        model.base_model.norm = None
        model.lm_head = None
    elif rank == 2:
        # from 2/3 to the end of the decoder layers and the final norm layer, LM head
        model.base_model.embed_tokens = None
        for n in range(partition[1]):
            model.base_model.layers[str(n)] = None
    else:
        raise ValueError(f"Invalid rank: {rank}")

...

with torch.device("meta"):

model_config = LlamaConfig()

model = LlamaForPretraining(model_config, stage=rank)

# Partition the model by removing some layers

num_layers = model_config.num_hidden_layers

partition = [num_layers // 3, 2 * num_layers // 3, num_layers]

if rank == 0:

# from embedding to 1/3 of the decoder layers

for n in range(partition[0], partition[2]):

model.base_model.layers[str(n)] = None

model.base_model.norm = None

model.lm_head = None

elif rank == 1:

# from 1/3 to 2/3 of the decoder layers

model.base_model.embed_tokens = None

for n in range(0, partition[0]):

model.base_model.layers[str(n)] = None

for n in range(partition[1], partition[2]):

model.base_model.layers[str(n)] = None

model.base_model.norm = None

model.lm_head = None

elif rank == 2:

# from 2/3 to the end of the decoder layers and the final norm layer, LM head

model.base_model.embed_tokens = None

for n in range(partition[1]):

model.base_model.layers[str(n)] = None

else:

raise ValueError(f"Invalid rank: {rank}")

The model is created using the class LlamaForPretraining defined in the previous post. If the model is too large, instantiating it would cause an out-of-memory error. Here, you create the model on a fake device meta. When a model is created on meta, the weights are not allocated.

In the code above, you partition the model into three stages: at rank 0 (the first stage), the model keeps the embedding layer and the first 1/3 of the decoder layers. At rank 1 (the second stage), the model keeps only the middle 1/3 of the decoder layers. At rank 2 (the third stage), the model keeps the last 1/3 of the decoder layers, the final normalization layer, and the prediction head. Components not needed in a particular stage are set to None. These stages have no overlap and tightly partition the model.

To make such a model work, you need to modify the model code so that when a component is None, it is skipped in the forward pass. This needs to be done in the classes LlamaModel and LlamaForPretraining:

..
class LlamaModel(nn.Module):
    """The full Llama model without any pretraining heads."""

    def __init__(self, config: LlamaConfig) -> None:
        super().__init__()
        self.rope = RotaryPositionEncoding(
            config.hidden_size // config.num_attention_heads,
            config.max_position_embeddings,
        )
        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
        self.layers = nn.ModuleDict({
            str(i): LlamaDecoderLayer(config) for i in range(config.num_hidden_layers)
        })
        self.norm = nn.RMSNorm(config.hidden_size, eps=1e-5)

    def forward(self, input_ids: Tensor) -> Tensor:
        # Convert input token IDs to embeddings
        if self.embed_tokens is not None:
            hidden_states = self.embed_tokens(input_ids)
        else:
            hidden_states = input_ids
        # Process through all transformer layers, then the final norm layer
        for n in range(len(self.layers)):
            if self.layers[str(n)] is not None:
                hidden_states = self.layers[str(n)](hidden_states, self.rope)
        if self.norm is not None:
            hidden_states = self.norm(hidden_states)
        # Return the final hidden states, and copy over the attention mask
        return hidden_states


class LlamaForPretraining(nn.Module):
    def __init__(self, config: LlamaConfig, stage) -> None:
        super().__init__()
        self.base_model = LlamaModel(config)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
        self.stage = stage

    def forward(self, input_ids: Tensor) -> Tensor:
        hidden_states = self.base_model(input_ids)
        if self.lm_head is not None:
            hidden_states = self.lm_head(hidden_states)
        return hidden_states

class LlamaModel(nn.Module):

"""The full Llama model without any pretraining heads."""

def __init__(self, config: LlamaConfig) -> None:

super().__init__()

self.rope = RotaryPositionEncoding(

config.hidden_size // config.num_attention_heads,

config.max_position_embeddings,

)

self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)

self.layers = nn.ModuleDict({

str(i): LlamaDecoderLayer(config) for i in range(config.num_hidden_layers)

})

self.norm = nn.RMSNorm(config.hidden_size, eps=1e-5)

def forward(self, input_ids: Tensor) -> Tensor:

# Convert input token IDs to embeddings

if self.embed_tokens is not None:

hidden_states = self.embed_tokens(input_ids)

else:

hidden_states = input_ids

# Process through all transformer layers, then the final norm layer

for n in range(len(self.layers)):

if self.layers[str(n)] is not None:

hidden_states = self.layers[str(n)](hidden_states, self.rope)

if self.norm is not None:

hidden_states = self.norm(hidden_states)

# Return the final hidden states, and copy over the attention mask

return hidden_states

class LlamaForPretraining(nn.Module):

def __init__(self, config: LlamaConfig, stage) -> None:

super().__init__()

self.base_model = LlamaModel(config)

self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

self.stage = stage

def forward(self, input_ids: Tensor) -> Tensor:

hidden_states = self.base_model(input_ids)

if self.lm_head is not None:

hidden_states = self.lm_head(hidden_states)

return hidden_states

You can see that several if-statements are added to check if the component is None before allowing it to process the hidden_states tensor.

After you create the partial model, you need to transfer it to the actual GPU. Transferring a model from the meta device to a real GPU device is done using the method to_empty(), not to(), as you need to allocate the weight tensors during the transfer:

...
def reset_all_weights(model: nn.Module) -> None:
    @torch.no_grad()
    def weight_reset(m: nn.Module):
        reset_parameters = getattr(m, "reset_parameters", None)
        if callable(reset_parameters):
            m.reset_parameters()

    # Applies fn recursively to model itself and all of model.children()
    model.apply(fn=weight_reset)

model.to_empty(device=device)
reset_all_weights(model)

...

def reset_all_weights(model: nn.Module) -> None:

@torch.no_grad()

def weight_reset(m: nn.Module):

reset_parameters = getattr(m, "reset_parameters", None)

if callable(reset_parameters):

m.reset_parameters()

# Applies fn recursively to model itself and all of model.children()

model.apply(fn=weight_reset)

model.to_empty(device=device)

reset_all_weights(model)

The function reset_all_weights() calls the reset_parameters() method on all model components. This initializes the weights correctly, such as setting the weights to normally distributed random values in nn.Linear modules or to all ones in nn.RMSNorm modules.

Stage and Pipeline Schedule

In PyTorch, pipeline parallelism should be executed with the torchrun command rather than running the script directly. This means multiple processes will be launched, each handling a stage of the pipeline.

When you write a script for torchrun, remember that multiple processes will execute the same script, and each process should operate only on its own scope of work. In pipeline parallelism, this means:

The script should create only one stage of the model
The script should set up a pipeline to allow communication between stages

The key is to use the process group in the torch.distributed module. When torchrun launches multiple processes, the total number of processes is called the world size. Each process has a unique rank. If you run these processes across multiple computers on a network, each process may be assigned a particular GPU device on a machine. The local rank identifies the device ID.

As with distributed data parallel, you should initialize the distributed environment before you set up the pipeline:

import torch.distributed as dist

dist.init_process_group(backend="nccl")
rank = dist.get_rank()
local_rank = int(os.environ["LOCAL_RANK"])
world_size = dist.get_world_size()
device = torch.device(f"cuda:{local_rank}")

import torch.distributed as dist

dist.init_process_group(backend="nccl")

rank = dist.get_rank()

local_rank = int(os.environ["LOCAL_RANK"])

world_size = dist.get_world_size()

device = torch.device(f"cuda:{local_rank}")

Then, you can create the stage object. It specifies which stage your model belongs to, which device it should run on, and how many stages there are in total:

...
from torch.distributed.pipelining import PipelineStage, ScheduleGPipe

stage = PipelineStage(model, stage_index=rank, num_stages=world_size, device=device)

...

from torch.distributed.pipelining import PipelineStage, ScheduleGPipe

stage = PipelineStage(model, stage_index=rank, num_stages=world_size, device=device)

Now that you have set up the model pipeline, you still need to specify how the data is processed into micro-batches within it. PyTorch offers multiple algorithms to utilize the pipeline, called schedules. The default is to use ScheduleGPipe:

...
def loss_fn(logits, target_ids):
    logits = logits.view(-1, logits.size(-1))
    target_ids = target_ids.view(-1)
    loss = F.cross_entropy(logits, target_ids, ignore_index=PAD_TOKEN_ID)
    return loss

n_microbatches = 4  # num split per batch
schedule = ScheduleGPipe(stage, n_microbatches=n_microbatches, loss_fn=loss_fn)

...

def loss_fn(logits, target_ids):

logits = logits.view(-1, logits.size(-1))

target_ids = target_ids.view(-1)

loss = F.cross_entropy(logits, target_ids, ignore_index=PAD_TOKEN_ID)

return loss

n_microbatches = 4 # num split per batch

schedule = ScheduleGPipe(stage, n_microbatches=n_microbatches, loss_fn=loss_fn)

As mentioned above, the transformer model you used is a stack of transformer blocks, each of which takes one tensor as input and produces one tensor as output. In pipeline parallelism, you do not explicitly run the model’s forward and backward passes; instead, you use the pipeline schedule to coordinate the stages.

Recall that the backward pass uses the output from the forward pass to compute the loss metric, then propagates the gradient back to the model parameters based on the loss. For the pipeline schedule to know how to trigger the backward pass, you need to implement a loss function, such as loss_fn() above.

The n_microbatches argument specifies how to split the batch into micro-batches. When you use pipeline parallelism, PyTorch expects a batched tensor as input to the pipeline schedule, which is then split and fed into the pipeline stages sequentially.

Micro-batches are key to keeping all GPUs busy, as each stage can process a different micro-batch in parallel. Once all micro-batches are processed, you aggregate the results to get the final output and perform gradient updates. This completes one training step; you then proceed to the next batch.

Not all GPUs are busy at all times. The number of idle GPUs and the duration of idle time are collectively referred to as the bubble. Pipeline scheduling algorithms vary in how they minimize bubble formation, which is critical to the efficiency of pipeline parallelism.

Bubbles in pipeline parallelism: The numbered boxes represent micro-batches processed by the devices; typically, the backward pass takes at least twice as long as the forward pass. The grey area means the devices are idle. The illustration is from Fig. 3 of Narayanan et al. (2021).

Training Loop

Once you have instantiated the partial model, created the pipeline stage object, and configured the schedule, the data loader, optimizer, and learning rate scheduler are the same as in single-GPU training.

However, in the training loop, you should use the pipeline schedule for the forward and backward passes. You should not call the model or compute the loss metric directly. Moreover, each stage of the pipeline works differently in the training loop. Below is how you should modify the training loop for pipeline parallelism:

...
for epoch in range(epochs):
    pbar = tqdm.tqdm(dataloader, desc=f"Epoch {epoch+1}/{epochs}", disable=(rank != world_size - 1))
    for batch_id, batch in enumerate(pbar):
        # zero grad before forward pass, since no explicit backward pass is called
        optimizer.zero_grad(set_to_none=True)
        # get batched data and run the pipeline
        input_ids, target_ids = batch
        if rank == 0:
            schedule.step(input_ids)
        elif rank == world_size - 1:
            losses = []  # expects one lost per microbatch
            logits = schedule.step(target=target_ids, losses=losses)
            with torch.no_grad():
                pbar.set_postfix(loss=sum(losses).item() / len(losses))
        else:
            schedule.step()
        # gradient update through optimizer
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        pbar.update(1)
    pbar.close()

...

for epoch in range(epochs):

pbar = tqdm.tqdm(dataloader, desc=f"Epoch {epoch+1}/{epochs}", disable=(rank != world_size - 1))

for batch_id, batch in enumerate(pbar):

# zero grad before forward pass, since no explicit backward pass is called

optimizer.zero_grad(set_to_none=True)

# get batched data and run the pipeline

input_ids, target_ids = batch

if rank == 0:

schedule.step(input_ids)

elif rank == world_size - 1:

losses = [] # expects one lost per microbatch

logits = schedule.step(target=target_ids, losses=losses)

with torch.no_grad():

pbar.set_postfix(loss=sum(losses).item() / len(losses))

else:

schedule.step()

# gradient update through optimizer

torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

optimizer.step()

scheduler.step()

pbar.update(1)

pbar.close()

You create the model object but never call it directly in the training loop. Instead, you pass the input tensor input_ids to the pipeline schedule if you are at rank 0. This is how you send the input to the first stage of the pipeline. For the remaining stages, call schedule.step() to have the pipeline process the output from the previous stage. In the final stage, you expect the model to produce its output. You provide the target tensor target_ids to signal that the loss function should be called to compute the loss metric and trigger the backward pass. The loss metric is not used explicitly in the training loop, as the pipeline schedule handles it internally. However, you can provide a Python list in the losses argument to store the loss metrics for each micro-batch.

After the model completes its forward and backward passes, the gradient is computed and stored with the model. You can then perform the usual gradient update processes, including gradient clipping, optimizer step, and learning rate scheduler update.

Since multiple processes will be running concurrently, you want to keep your output clean. Therefore, the tqdm progress bar is displayed only on the last stage, where you can collect the loss metric and print it. Note that cross-entropy loss is averaged per prediction by default, so it is averaged across all micro-batches to make it comparable to single-GPU training.

Distributed Checkpointing

Pipeline parallelism is unique in that no process contains the full model. Therefore, you cannot use model.state_dict() to get the model weights and save them with torch.save().

Saving the model with pipeline parallelism is tricky: you need to ensure all processes save the model simultaneously, preventing one process from having updated gradients while another does not. You also want to avoid reassembling the full model in any process to maintain speed.

In PyTorch, you need to use the distributed checkpointing API for this purpose. You typically save both the model and optimizer state together since they are tightly coupled. Below is a save function:

...
from torch.distributed.checkpoint import load, save
from torch.distributed.checkpoint.state_dict import get_state_dict, set_state_dict, StateDictOptions

def save_checkpoint(model, optimizer):
    dist.barrier()
    model_state, optimizer_state = get_state_dict(
        model, optimizer, options=StateDictOptions(full_state_dict=True)
    )
    save(
        {"model": model_state, "optimizer": optimizer_state},
        checkpoint_id="checkpoint-dist",  # each rank will save its own file
    )
    dist.barrier()

...

from torch.distributed.checkpoint import load, save

from torch.distributed.checkpoint.state_dict import get_state_dict, set_state_dict, StateDictOptions

def save_checkpoint(model, optimizer):

dist.barrier()

model_state, optimizer_state = get_state_dict(

model, optimizer, options=StateDictOptions(full_state_dict=True)

)

save(

{"model": model_state, "optimizer": optimizer_state},

checkpoint_id="checkpoint-dist", # each rank will save its own file

)

dist.barrier()

Before you save, call dist.barrier() to synchronize all processes. After you save, call dist.barrier() again to ensure all save operations are complete before resuming training, preventing partial gradient updates.

Unlike torch.save(), you do not save to a single file. Instead, each process saves to a different file based on its rank. You also do not use model.state_dict() for this purpose. The save() function takes a checkpoint ID, which is the directory name to use. The file created by each process will be named __3_0.distcp for rank 3, for example. This is not in the same format as files created by torch.save().

To restore the model, you use a similar workflow:

...
def load_checkpoint(model, optimizer):
    dist.barrier()
    model_state, optimizer_state = get_state_dict(
        model, optimizer, options=StateDictOptions(full_state_dict=True)
    )
    load(
        {"model": model_state, "optimizer": optimizer_state},
        checkpoint_id="checkpoint-dist"  # each rank will save its own file
    )
    # necessary if model.load_state_dict() should be called
    set_state_dict(
        model, optimizer,
        model_state_dict=model_state, optim_state_dict=optimizer_state,
        options=StateDictOptions(broadcast_from_rank0=True, full_state_dict=True)
    )
    dist.barrier()

...

def load_checkpoint(model, optimizer):

dist.barrier()

model_state, optimizer_state = get_state_dict(

model, optimizer, options=StateDictOptions(full_state_dict=True)

)

load(

{"model": model_state, "optimizer": optimizer_state},

checkpoint_id="checkpoint-dist" # each rank will save its own file

)

# necessary if model.load_state_dict() should be called

set_state_dict(

model, optimizer,

model_state_dict=model_state, optim_state_dict=optimizer_state,

options=StateDictOptions(broadcast_from_rank0=True, full_state_dict=True)

)

dist.barrier()

The load() function is similar to save(): you need to pass a checkpoint ID and a dictionary of states. Unlike torch.load(), which returns a state dictionary, this method loads the checkpoint in-place. Therefore, using get_state_dict() to retrieve the model and optimizer weights and states is necessary.

Since load() updates the weights in-place, you need to call it with the correct arguments and fence it with dist.barrier() to ensure all processes are synchronized. However, some models may override the load_state_dict() method to perform additional operations. To be safe, you can call set_state_dict() as shown above to trigger the load_state_dict() method on both the model and optimizer. This does not harm if in-place weight updates are sufficient.

Also note that if you have other objects not managed by the pipeline, such as the learning rate scheduler, you still need to use torch.save() and torch.load() to save and restore them.

That’s all that’s needed to run model training with pipeline parallelism. For completeness, below is the full code:

import dataclasses
import os

import datasets
import tokenizers
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim.lr_scheduler as lr_scheduler
import tqdm
from torch import Tensor
from torch.distributed.checkpoint import load, save
from torch.distributed.checkpoint.state_dict import StateDictOptions, get_state_dict, set_state_dict
from torch.distributed.pipelining import PipelineStage, ScheduleGPipe


# Build the model
@dataclasses.dataclass
class LlamaConfig:
    """Define Llama model hyperparameters."""
    vocab_size: int = 50000  # Size of the tokenizer vocabulary
    max_position_embeddings: int = 2048  # Maximum sequence length
    hidden_size: int = 768  # Dimension of hidden layers
    intermediate_size: int = 4*768  # Dimension of MLP's hidden layer
    num_hidden_layers: int = 12  # Number of transformer layers
    num_attention_heads: int = 12  # Number of attention heads
    num_key_value_heads: int = 3  # Number of key-value heads for GQA


class RotaryPositionEncoding(nn.Module):
    """Rotary position encoding."""

    def __init__(self, dim: int, max_position_embeddings: int) -> None:
        """Initialize the RotaryPositionEncoding module.

        Args:
            dim: The hidden dimension of the input tensor to which RoPE is applied
            max_position_embeddings: The maximum sequence length of the input tensor
        """
        super().__init__()
        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        # compute a matrix of n\theta_i
        N = 10_000.0
        inv_freq = 1.0 / (N ** (torch.arange(0, dim, 2) / dim))
        inv_freq = torch.cat((inv_freq, inv_freq), dim=-1)
        position = torch.arange(max_position_embeddings)
        sinusoid_inp = torch.outer(position, inv_freq)
        # save cosine and sine matrices as buffers, not parameters
        self.register_buffer("cos", sinusoid_inp.cos())
        self.register_buffer("sin", sinusoid_inp.sin())

    def forward(self, x: Tensor) -> Tensor:
        """Apply RoPE to tensor x.

        Args:
            x: Input tensor of shape (batch_size, seq_length, num_heads, head_dim)

        Returns:
            Output tensor of shape (batch_size, seq_length, num_heads, head_dim)
        """
        batch_size, seq_len, num_heads, head_dim = x.shape
        dtype = x.dtype
        # transform the cosine and sine matrices to 4D tensor and the same dtype as x
        cos = self.cos.to(dtype)[:seq_len].view(1, seq_len, 1, -1)
        sin = self.sin.to(dtype)[:seq_len].view(1, seq_len, 1, -1)
        # apply RoPE to x
        x1, x2 = x.chunk(2, dim=-1)
        rotated = torch.cat((-x2, x1), dim=-1)
        output = (x * cos) + (rotated * sin)
        return output


class LlamaAttention(nn.Module):
    """Grouped-query attention with rotary embeddings."""

    def __init__(self, config: LlamaConfig) -> None:
        super().__init__()
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.head_dim = self.hidden_size // self.num_heads
        self.num_kv_heads = config.num_key_value_heads  # GQA: H_kv < H_q

        # hidden_size must be divisible by num_heads
        assert (self.head_dim * self.num_heads) == self.hidden_size

        # Linear layers for Q, K, V projections
        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.k_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False)
        self.v_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False)
        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)

    def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding) -> Tensor:
        bs, seq_len, dim = hidden_states.size()

        # Project inputs to Q, K, V
        query_states = self.q_proj(hidden_states).view(bs, seq_len, self.num_heads, self.head_dim)
        key_states = self.k_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim)
        value_states = self.v_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim)

        # Apply rotary position embeddings
        query_states = rope(query_states)
        key_states = rope(key_states)

        # Transpose tensors from BSHD to BHSD dimension for scaled_dot_product_attention
        query_states = query_states.transpose(1, 2)
        key_states = key_states.transpose(1, 2)
        value_states = value_states.transpose(1, 2)

        # Use PyTorch's optimized attention implementation
        # setting is_causal=True is incompatible with setting explicit attention mask
        attn_output = F.scaled_dot_product_attention(
            query_states,
            key_states,
            value_states,
            is_causal=True,
            dropout_p=0.0,
            enable_gqa=True,
        )

        # Transpose output tensor from BHSD to BSHD dimension, reshape to 3D, and then project output
        attn_output = attn_output.transpose(1, 2).reshape(bs, seq_len, self.hidden_size)
        attn_output = self.o_proj(attn_output)
        return attn_output


class LlamaMLP(nn.Module):
    """Feed-forward network with SwiGLU activation."""

    def __init__(self, config: LlamaConfig) -> None:
        super().__init__()
        # Two parallel projections for SwiGLU
        self.gate_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
        self.up_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
        self.act_fn = F.silu  # SwiGLU activation function
        # Project back to hidden size
        self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)

    def forward(self, x: Tensor) -> Tensor:
        # SwiGLU activation: multiply gate and up-projected inputs
        gate = self.act_fn(self.gate_proj(x))
        up = self.up_proj(x)
        return self.down_proj(gate * up)


class LlamaDecoderLayer(nn.Module):
    """Single transformer layer for a Llama model."""

    def __init__(self, config: LlamaConfig) -> None:
        super().__init__()
        self.input_layernorm = nn.RMSNorm(config.hidden_size, eps=1e-5)
        self.self_attn = LlamaAttention(config)
        self.post_attention_layernorm = nn.RMSNorm(config.hidden_size, eps=1e-5)
        self.mlp = LlamaMLP(config)

    def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding) -> Tensor:
        # First residual block: Self-attention
        residual = hidden_states
        hidden_states = self.input_layernorm(hidden_states)
        attn_outputs = self.self_attn(hidden_states, rope=rope)
        hidden_states = attn_outputs + residual

        # Second residual block: MLP
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states) + residual
        return hidden_states


class LlamaModel(nn.Module):
    """The full Llama model without any pretraining heads."""

    def __init__(self, config: LlamaConfig) -> None:
        super().__init__()
        self.rope = RotaryPositionEncoding(
            config.hidden_size // config.num_attention_heads,
            config.max_position_embeddings,
        )

        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
        self.layers = nn.ModuleDict({
            str(i): LlamaDecoderLayer(config) for i in range(config.num_hidden_layers)
        })
        self.norm = nn.RMSNorm(config.hidden_size, eps=1e-5)

    def forward(self, input_ids: Tensor) -> Tensor:
        # Convert input token IDs to embeddings
        if self.embed_tokens is not None:
            hidden_states = self.embed_tokens(input_ids)
        else:
            hidden_states = input_ids
        # Process through all transformer layers, then the final norm layer
        for n in range(len(self.layers)):
            if self.layers[str(n)] is not None:
                hidden_states = self.layers[str(n)](hidden_states, self.rope)
        if self.norm is not None:
            hidden_states = self.norm(hidden_states)
        # Return the final hidden states, and copy over the attention mask
        return hidden_states


class LlamaForPretraining(nn.Module):
    def __init__(self, config: LlamaConfig) -> None:
        super().__init__()
        self.base_model = LlamaModel(config)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

    def forward(self, input_ids: Tensor) -> Tensor:
        hidden_states = self.base_model(input_ids)
        if self.lm_head is not None:
            hidden_states = self.lm_head(hidden_states)
        return hidden_states


# Generator function to create padded sequences of fixed length
class PretrainingDataset(torch.utils.data.Dataset):
    def __init__(self, dataset: datasets.Dataset, tokenizer: tokenizers.Tokenizer,
                 seq_length: int, device: torch.device = None):
        self.dataset = dataset
        self.tokenizer = tokenizer
        self.device = device
        self.seq_length = seq_length
        self.bot = tokenizer.token_to_id("[BOT]")
        self.eot = tokenizer.token_to_id("[EOT]")
        self.pad = tokenizer.token_to_id("[PAD]")

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        """Get a sequence of token ids from the dataset. [BOT] and [EOT] tokens
        are added. Clipped and padded to the sequence length.
        """
        seq = self.dataset[index]["text"]
        tokens: list[int] = [self.bot] + self.tokenizer.encode(seq).ids + [self.eot]
        # pad to target sequence length
        toklen = len(tokens)
        if toklen < self.seq_length+1:
            pad_length = self.seq_length+1 - toklen
            tokens += [self.pad] * pad_length
        # return the sequence
        x = torch.tensor(tokens[:self.seq_length], dtype=torch.int64, device=self.device)
        y = torch.tensor(tokens[1:self.seq_length+1], dtype=torch.int64, device=self.device)
        return x, y


def load_checkpoint(model: nn.Module, optimizer: torch.optim.Optimizer) -> None:
    dist.barrier()
    model_state, optimizer_state = get_state_dict(
        model, optimizer, options=StateDictOptions(full_state_dict=True),
    )
    load(
        {"model": model_state, "optimizer": optimizer_state},
        checkpoint_id="checkpoint-dist",
    )
    set_state_dict(
        model, optimizer,
        model_state_dict=model_state, optim_state_dict=optimizer_state,
        options=StateDictOptions(broadcast_from_rank0=True, full_state_dict=True),
    )
    dist.barrier()


def save_checkpoint(model: nn.Module, optimizer: torch.optim.Optimizer) -> None:
    dist.barrier()
    model_state, optimizer_state = get_state_dict(
        model, optimizer, options=StateDictOptions(full_state_dict=True),
    )
    save(
        {"model": model_state, "optimizer": optimizer_state},
        checkpoint_id="checkpoint-dist",
    )
    dist.barrier()


# Load the tokenizer and dataset
tokenizer = tokenizers.Tokenizer.from_file("bpe_50K.json")
dataset = datasets.load_dataset("HuggingFaceFW/fineweb", "sample-10BT", split="train")

# Initialize the distributed environment
dist.init_process_group(backend="nccl")
rank = dist.get_rank()
local_rank = int(os.environ["LOCAL_RANK"])
world_size = dist.get_world_size()
device = torch.device(f"cuda:{local_rank}")
print(f"World size {world_size}, rank {rank}, local rank {local_rank}. Using {device}")
assert world_size == 3, f"This script is designed for 3 GPUs, got {world_size}"

# Create pretraining model with default config on meta device to prevent OOM
with torch.device("meta"):
    model_config = LlamaConfig()
    model = LlamaForPretraining(model_config)
    # Partition the model by removing some layers
    num_layers = model_config.num_hidden_layers
    partition = [num_layers // 3, 2 * num_layers // 3, num_layers]
    if rank == 0:
        # from embedding to 1/3 of the decoder layers
        for n in range(partition[0], partition[2]):
            model.base_model.layers[str(n)] = None
        model.base_model.norm = None
        model.lm_head = None
    elif rank == 1:
        # from 1/3 to 2/3 of the decoder layers
        model.base_model.embed_tokens = None
        for n in range(0, partition[0]):
            model.base_model.layers[str(n)] = None
        for n in range(partition[1], partition[2]):
            model.base_model.layers[str(n)] = None
        model.base_model.norm = None
        model.lm_head = None
    elif rank == 2:
        # from 2/3 to the end of the decoder layers and the final norm layer, LM head
        model.base_model.embed_tokens = None
        for n in range(partition[1]):
            model.base_model.layers[str(n)] = None
    else:
        raise ValueError(f"Invalid rank: {rank}")


# Move model from meta device to CUDA device, then initialize the weights
def reset_all_weights(model: nn.Module) -> None:
    @torch.no_grad()
    def weight_reset(m: nn.Module):
        reset_parameters = getattr(m, "reset_parameters", None)
        if callable(reset_parameters):
            m.reset_parameters()

    # Applies fn recursively to model itself and all of model.children()
    model.apply(fn=weight_reset)


model.to_empty(device=device)
reset_all_weights(model)
model.train()
stage = PipelineStage(model, stage_index=rank, num_stages=world_size, device=device)

# Training parameters
epochs = 3
learning_rate = 1e-3
batch_size = 64
seq_length = 512
num_warmup_steps = 1000
PAD_TOKEN_ID = tokenizer.token_to_id("[PAD]")

# DataLoader, optimizer, scheduler, and loss function
dataset = PretrainingDataset(dataset, tokenizer, seq_length, device)
dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=batch_size,
)
num_training_steps = len(dataloader) * epochs
print(f"Number of training steps: {num_training_steps} = {len(dataloader)} * {epochs}")

optimizer = torch.optim.AdamW(
    model.parameters(), lr=learning_rate, betas=(0.9, 0.99), eps=1e-8, weight_decay=0.1,
)
warmup_scheduler = lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.1, end_factor=1.0, total_iters=num_warmup_steps,
)
cosine_scheduler = lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=num_training_steps - num_warmup_steps,
    eta_min=0,
)
scheduler = lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup_scheduler, cosine_scheduler],
    milestones=[num_warmup_steps],
)

# if checkpoint-dist dir exists, load the checkpoint to model and optimizer
# Note: You should implement how to reset the epoch and step to allow correct resume
if os.path.exists("checkpoint-dist"):
    load_checkpoint(model, optimizer)

# Create pipeline schedule
def loss_fn(logits: Tensor, target_ids: Tensor) -> Tensor:
    logits = logits.view(-1, logits.size(-1))
    target_ids = target_ids.view(-1)
    return F.cross_entropy(logits, target_ids, ignore_index=PAD_TOKEN_ID)

n_microbatches = 4  # num split per batch
schedule = ScheduleGPipe(stage, n_microbatches=n_microbatches, loss_fn=loss_fn)

# start training
for epoch in range(epochs):
    pbar = tqdm.tqdm(dataloader, desc=f"Epoch {epoch+1}/{epochs}", disable=(rank != world_size - 1))
    for batch_id, batch in enumerate(pbar):
        if batch_id % 1000 == 0:
            save_checkpoint(model, optimizer)
        # zero grad before forward pass, since no explicit backward pass is called
        optimizer.zero_grad(set_to_none=True)
        # get batched data
        input_ids, target_ids = batch
        if rank == 0:
            schedule.step(input_ids)
        elif rank == world_size - 1:
            losses = []  # expects one lost per microbatch
            logits = schedule.step(target=target_ids, losses=losses)
            with torch.no_grad():
                pbar.set_postfix(loss=sum(losses).item() / len(losses))
        else:
            schedule.step()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        pbar.update(1)
    pbar.close()

# Save the model
save_checkpoint(model, optimizer)

# Clean up the distributed environment
dist.destroy_process_group()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

import dataclasses

import os

import datasets

import tokenizers

import torch

import torch.distributed as dist

import torch.nn as nn

import torch.nn.functional as F

import torch.optim.lr_scheduler as lr_scheduler

import tqdm

from torch import Tensor

from torch.distributed.checkpoint import load, save

from torch.distributed.checkpoint.state_dict import StateDictOptions, get_state_dict, set_state_dict

from torch.distributed.pipelining import PipelineStage, ScheduleGPipe

# Build the model

@dataclasses.dataclass

class LlamaConfig:

"""Define Llama model hyperparameters."""

vocab_size: int = 50000 # Size of the tokenizer vocabulary

max_position_embeddings: int = 2048 # Maximum sequence length

hidden_size: int = 768 # Dimension of hidden layers

intermediate_size: int = 4*768 # Dimension of MLP's hidden layer

num_hidden_layers: int = 12 # Number of transformer layers

num_attention_heads: int = 12 # Number of attention heads

num_key_value_heads: int = 3 # Number of key-value heads for GQA

class RotaryPositionEncoding(nn.Module):

"""Rotary position encoding."""

def __init__(self, dim: int, max_position_embeddings: int) -> None:

"""Initialize the RotaryPositionEncoding module.

Args:

dim: The hidden dimension of the input tensor to which RoPE is applied

max_position_embeddings: The maximum sequence length of the input tensor

"""

super().__init__()

self.dim = dim

self.max_position_embeddings = max_position_embeddings

# compute a matrix of n\theta_i

N = 10_000.0

inv_freq = 1.0 / (N ** (torch.arange(0, dim, 2) / dim))

inv_freq = torch.cat((inv_freq, inv_freq), dim=-1)

position = torch.arange(max_position_embeddings)

sinusoid_inp = torch.outer(position, inv_freq)

# save cosine and sine matrices as buffers, not parameters

self.register_buffer("cos", sinusoid_inp.cos())

self.register_buffer("sin", sinusoid_inp.sin())

def forward(self, x: Tensor) -> Tensor:

"""Apply RoPE to tensor x.

Args:

x: Input tensor of shape (batch_size, seq_length, num_heads, head_dim)

Returns:

Output tensor of shape (batch_size, seq_length, num_heads, head_dim)

"""

batch_size, seq_len, num_heads, head_dim = x.shape

dtype = x.dtype

# transform the cosine and sine matrices to 4D tensor and the same dtype as x

cos = self.cos.to(dtype)[:seq_len].view(1, seq_len, 1, -1)

sin = self.sin.to(dtype)[:seq_len].view(1, seq_len, 1, -1)

# apply RoPE to x

x1, x2 = x.chunk(2, dim=-1)

rotated = torch.cat((-x2, x1), dim=-1)

output = (x * cos) + (rotated * sin)

return output

class LlamaAttention(nn.Module):

"""Grouped-query attention with rotary embeddings."""

def __init__(self, config: LlamaConfig) -> None:

super().__init__()

self.hidden_size = config.hidden_size

self.num_heads = config.num_attention_heads

self.head_dim = self.hidden_size // self.num_heads

self.num_kv_heads = config.num_key_value_heads # GQA: H_kv < H_q

# hidden_size must be divisible by num_heads

assert (self.head_dim * self.num_heads) == self.hidden_size

# Linear layers for Q, K, V projections

self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)

self.k_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False)

self.v_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False)

self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)

def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding) -> Tensor:

bs, seq_len, dim = hidden_states.size()

# Project inputs to Q, K, V

query_states = self.q_proj(hidden_states).view(bs, seq_len, self.num_heads, self.head_dim)

key_states = self.k_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim)

value_states = self.v_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim)

# Apply rotary position embeddings

query_states = rope(query_states)

key_states = rope(key_states)

# Transpose tensors from BSHD to BHSD dimension for scaled_dot_product_attention

query_states = query_states.transpose(1, 2)

key_states = key_states.transpose(1, 2)

value_states = value_states.transpose(1, 2)

# Use PyTorch's optimized attention implementation

# setting is_causal=True is incompatible with setting explicit attention mask

attn_output = F.scaled_dot_product_attention(

query_states,

key_states,

value_states,

is_causal=True,

dropout_p=0.0,

enable_gqa=True,

)

# Transpose output tensor from BHSD to BSHD dimension, reshape to 3D, and then project output

attn_output = attn_output.transpose(1, 2).reshape(bs, seq_len, self.hidden_size)

attn_output = self.o_proj(attn_output)

return attn_output

class LlamaMLP(nn.Module):

"""Feed-forward network with SwiGLU activation."""

def __init__(self, config: LlamaConfig) -> None:

super().__init__()

# Two parallel projections for SwiGLU

self.gate_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)

self.up_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)

self.act_fn = F.silu # SwiGLU activation function

# Project back to hidden size

self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)

def forward(self, x: Tensor) -> Tensor:

# SwiGLU activation: multiply gate and up-projected inputs

gate = self.act_fn(self.gate_proj(x))

up = self.up_proj(x)

return self.down_proj(gate * up)

class LlamaDecoderLayer(nn.Module):

"""Single transformer layer for a Llama model."""

def __init__(self, config: LlamaConfig) -> None:

super().__init__()

self.input_layernorm = nn.RMSNorm(config.hidden_size, eps=1e-5)

self.self_attn = LlamaAttention(config)

self.post_attention_layernorm = nn.RMSNorm(config.hidden_size, eps=1e-5)

self.mlp = LlamaMLP(config)

def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding) -> Tensor:

# First residual block: Self-attention

residual = hidden_states

hidden_states = self.input_layernorm(hidden_states)

attn_outputs = self.self_attn(hidden_states, rope=rope)

hidden_states = attn_outputs + residual

# Second residual block: MLP

residual = hidden_states

hidden_states = self.post_attention_layernorm(hidden_states)

hidden_states = self.mlp(hidden_states) + residual

return hidden_states

class LlamaModel(nn.Module):

"""The full Llama model without any pretraining heads."""

def __init__(self, config: LlamaConfig) -> None:

super().__init__()

self.rope = RotaryPositionEncoding(

config.hidden_size // config.num_attention_heads,

config.max_position_embeddings,

)

self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)

self.layers = nn.ModuleDict({

str(i): LlamaDecoderLayer(config) for i in range(config.num_hidden_layers)

})

self.norm = nn.RMSNorm(config.hidden_size, eps=1e-5)

def forward(self, input_ids: Tensor) -> Tensor:

# Convert input token IDs to embeddings

if self.embed_tokens is not None:

hidden_states = self.embed_tokens(input_ids)

else:

hidden_states = input_ids

# Process through all transformer layers, then the final norm layer

for n in range(len(self.layers)):

if self.layers[str(n)] is not None:

hidden_states = self.layers[str(n)](hidden_states, self.rope)

if self.norm is not None:

hidden_states = self.norm(hidden_states)

# Return the final hidden states, and copy over the attention mask

return hidden_states

class LlamaForPretraining(nn.Module):

def __init__(self, config: LlamaConfig) -> None:

super().__init__()

self.base_model = LlamaModel(config)

self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

def forward(self, input_ids: Tensor) -> Tensor:

hidden_states = self.base_model(input_ids)

if self.lm_head is not None:

hidden_states = self.lm_head(hidden_states)

return hidden_states

# Generator function to create padded sequences of fixed length

class PretrainingDataset(torch.utils.data.Dataset):

def __init__(self, dataset: datasets.Dataset, tokenizer: tokenizers.Tokenizer,

seq_length: int, device: torch.device = None):

self.dataset = dataset

self.tokenizer = tokenizer

self.device = device

self.seq_length = seq_length

self.bot = tokenizer.token_to_id("[BOT]")

self.eot = tokenizer.token_to_id("[EOT]")

self.pad = tokenizer.token_to_id("[PAD]")

def __len__(self):

return len(self.dataset)

def __getitem__(self, index):

"""Get a sequence of token ids from the dataset. [BOT] and [EOT] tokens

are added. Clipped and padded to the sequence length.

"""

seq = self.dataset[index]["text"]

tokens: list[int] = [self.bot] + self.tokenizer.encode(seq).ids + [self.eot]

# pad to target sequence length

toklen = len(tokens)

if toklen < self.seq_length+1:

pad_length = self.seq_length+1 - toklen

tokens += [self.pad] * pad_length

# return the sequence

x = torch.tensor(tokens[:self.seq_length], dtype=torch.int64, device=self.device)

y = torch.tensor(tokens[1:self.seq_length+1], dtype=torch.int64, device=self.device)

return x, y

def load_checkpoint(model: nn.Module, optimizer: torch.optim.Optimizer) -> None:

dist.barrier()

model_state, optimizer_state = get_state_dict(

model, optimizer, options=StateDictOptions(full_state_dict=True),

)

load(

{"model": model_state, "optimizer": optimizer_state},

checkpoint_id="checkpoint-dist",

)

set_state_dict(

model, optimizer,

model_state_dict=model_state, optim_state_dict=optimizer_state,

options=StateDictOptions(broadcast_from_rank0=True, full_state_dict=True),

)

dist.barrier()

def save_checkpoint(model: nn.Module, optimizer: torch.optim.Optimizer) -> None:

dist.barrier()

model_state, optimizer_state = get_state_dict(

model, optimizer, options=StateDictOptions(full_state_dict=True),

)

save(

{"model": model_state, "optimizer": optimizer_state},

checkpoint_id="checkpoint-dist",

)

dist.barrier()

# Load the tokenizer and dataset

tokenizer = tokenizers.Tokenizer.from_file("bpe_50K.json")

dataset = datasets.load_dataset("HuggingFaceFW/fineweb", "sample-10BT", split="train")

# Initialize the distributed environment

dist.init_process_group(backend="nccl")

rank = dist.get_rank()

local_rank = int(os.environ["LOCAL_RANK"])

world_size = dist.get_world_size()

device = torch.device(f"cuda:{local_rank}")

print(f"World size {world_size}, rank {rank}, local rank {local_rank}. Using {device}")

assert world_size == 3, f"This script is designed for 3 GPUs, got {world_size}"

# Create pretraining model with default config on meta device to prevent OOM

with torch.device("meta"):

model_config = LlamaConfig()

model = LlamaForPretraining(model_config)

# Partition the model by removing some layers

num_layers = model_config.num_hidden_layers

partition = [num_layers // 3, 2 * num_layers // 3, num_layers]

if rank == 0:

# from embedding to 1/3 of the decoder layers

for n in range(partition[0], partition[2]):

model.base_model.layers[str(n)] = None

model.base_model.norm = None

model.lm_head = None

elif rank == 1:

# from 1/3 to 2/3 of the decoder layers

model.base_model.embed_tokens = None

for n in range(0, partition[0]):

model.base_model.layers[str(n)] = None

for n in range(partition[1], partition[2]):

model.base_model.layers[str(n)] = None

model.base_model.norm = None

model.lm_head = None

elif rank == 2:

# from 2/3 to the end of the decoder layers and the final norm layer, LM head

model.base_model.embed_tokens = None

for n in range(partition[1]):

model.base_model.layers[str(n)] = None

else:

raise ValueError(f"Invalid rank: {rank}")

# Move model from meta device to CUDA device, then initialize the weights

def reset_all_weights(model: nn.Module) -> None:

@torch.no_grad()

def weight_reset(m: nn.Module):

reset_parameters = getattr(m, "reset_parameters", None)

if callable(reset_parameters):

m.reset_parameters()

# Applies fn recursively to model itself and all of model.children()

model.apply(fn=weight_reset)

model.to_empty(device=device)

reset_all_weights(model)

model.train()

stage = PipelineStage(model, stage_index=rank, num_stages=world_size, device=device)

# Training parameters

epochs = 3

learning_rate = 1e-3

batch_size = 64

seq_length = 512

num_warmup_steps = 1000

PAD_TOKEN_ID = tokenizer.token_to_id("[PAD]")

# DataLoader, optimizer, scheduler, and loss function

dataset = PretrainingDataset(dataset, tokenizer, seq_length, device)

dataloader = torch.utils.data.DataLoader(

dataset,

batch_size=batch_size,

)

num_training_steps = len(dataloader) * epochs

print(f"Number of training steps: {num_training_steps} = {len(dataloader)} * {epochs}")

optimizer = torch.optim.AdamW(

model.parameters(), lr=learning_rate, betas=(0.9, 0.99), eps=1e-8, weight_decay=0.1,

)

warmup_scheduler = lr_scheduler.LinearLR(

optimizer,

start_factor=0.1, end_factor=1.0, total_iters=num_warmup_steps,

)

cosine_scheduler = lr_scheduler.CosineAnnealingLR(

optimizer,

T_max=num_training_steps - num_warmup_steps,

eta_min=0,

)

scheduler = lr_scheduler.SequentialLR(

optimizer,

schedulers=[warmup_scheduler, cosine_scheduler],

milestones=[num_warmup_steps],

)

# if checkpoint-dist dir exists, load the checkpoint to model and optimizer

# Note: You should implement how to reset the epoch and step to allow correct resume

if os.path.exists("checkpoint-dist"):

load_checkpoint(model, optimizer)

# Create pipeline schedule

def loss_fn(logits: Tensor, target_ids: Tensor) -> Tensor:

logits = logits.view(-1, logits.size(-1))

target_ids = target_ids.view(-1)

return F.cross_entropy(logits, target_ids, ignore_index=PAD_TOKEN_ID)

n_microbatches = 4 # num split per batch

schedule = ScheduleGPipe(stage, n_microbatches=n_microbatches, loss_fn=loss_fn)

# start training

for epoch in range(epochs):

pbar = tqdm.tqdm(dataloader, desc=f"Epoch {epoch+1}/{epochs}", disable=(rank != world_size - 1))

for batch_id, batch in enumerate(pbar):

if batch_id % 1000 == 0:

save_checkpoint(model, optimizer)

# zero grad before forward pass, since no explicit backward pass is called

optimizer.zero_grad(set_to_none=True)

# get batched data

input_ids, target_ids = batch

if rank == 0:

schedule.step(input_ids)

elif rank == world_size - 1:

losses = [] # expects one lost per microbatch

logits = schedule.step(target=target_ids, losses=losses)

with torch.no_grad():

pbar.set_postfix(loss=sum(losses).item() / len(losses))

else:

schedule.step()

torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

optimizer.step()

scheduler.step()

pbar.update(1)

pbar.close()

# Save the model

save_checkpoint(model, optimizer)

# Clean up the distributed environment

dist.destroy_process_group()

Be sure to run this script with the torchrun command. For example, on a single computer with 3 GPUs:

torchrun --standalone --nproc_per_node=3 training.py

1	torchrun --standalone --nproc_per_node=3 training.py

If you need to run it on multiple machines, you should use the commands:

# run on each machine, with different node rank
torchrun --nnodes=3 --nproc_per_node=1 --node_rank=0 --master_addr=10.1.1.1 --master_port=12345 training.py

1 2	# run on each machine, with different node rank torchrun --nnodes=3 --nproc_per_node=1 --node_rank=0 --master_addr=10.1.1.1 --master_port=12345 training.py

Limitations of Pipeline Parallelism

Comparing the model code from the previous post and the code above, you can see that the model no longer takes the attention mask as input. Instead, the attention function in the class LlamaAttention is called with is_causal=True to create a causal attention mask internally.

Numerically, these two implementations are equivalent, as the training loss ignores the padding tokens. However, without the padding mask, you spend more time computing attention weights that are not used.

This modification is necessary to use pipeline parallelism, as the pipeline schedule does not work well when the model takes two arguments in the forward pass. This may improve in the future, as the PyTorch pipeline-parallelism API is still experimental.

Summary

In this article, you learned about pipeline parallelism and how to use it in PyTorch. Specifically, you learned:

Pipeline parallelism is a technique to train a model on multiple GPUs by splitting the model into multiple stages.
The pipeline schedule coordinates the pipeline’s stages.
Distributed checkpointing is used to save and restore the model weights and optimizer state in a distributed environment, since you no longer have a single process with access to the full model.
There are limitations in the current PyTorch pipeline-parallelism API. Your model may require modifications to support pipeline parallelism.