Creating a Llama or GPT Model for Next-Token Prediction

Natural language generation (NLG) is challenging because human language is complex and unpredictable. A naive approach of generating words randomly one by one would not be meaningful to humans. Modern decoder-only transformer models have proven effective for NLG tasks when trained on large amounts of text data. These models can be huge, but their structure is relatively simple. In this article, you will learn how to create a Llama or GPT model for next-token prediction.

Let’s get started.

Creating a Llama or GPT Model for Next-Token Prediction
Photo by Roman Kraft. Some rights reserved.

Overview

This article is divided into three parts; they are:

  • Understanding the Architecture of the Llama or GPT Model
  • Creating a Llama or GPT Model for Pretraining
  • Variations in the Architecture

Understanding the Architecture of the Llama or GPT Model

The architecture of a Llama or GPT model is simply a stack of transformer blocks. Each transformer block consists of a self-attention sub-layer and a feed-forward sub-layer, with normalization and residual connections applied around each sub-layer. That’s all the model has.

The architecture of GPT-2 and Llama models is shown in the figure below.

The GPT-2 (left) and Llama (right) model architectures

GPT-2 is older and mostly follows the architecture of the original encoder-decoder transformer. It uses layer normalization and multi-headed attention, along with two linear layers in the feed-forward sub-layer. GPT-2 differs from the original transformer design in:

  • Using pre-norm instead of post-norm
  • Using learned position embeddings instead of sinusoidal embeddings
  • Using GELU as the activation function in the feed-forward sub-layer instead of ReLU

Since GPT-2 is a decoder-only model, no encoder input is available, so the cross-attention sub-layer from the original transformer’s decoder stack is removed. However, you should not confuse its architecture with a transformer encoder because a causal mask is applied to the self-attention sublayer, forcing the model to attend only to the left context.

The Llama model revised several aspects of GPT-2, and versions 1 through 3 of Llama share the same architecture. Like GPT-2, the Llama model is a decoder-only model that uses pre-norm. However, unlike GPT-2, the Llama model:

  • Uses grouped query attention (GQA) instead of multi-headed attention (MHA)
  • Uses rotary position embeddings (RoPE) instead of learned position embeddings or sinusoidal position embeddings, and this is applied at each self-attention sub-layer.
  • Uses SwiGLU as the activation function in the feed-forward sub-layer instead of ReLU. As a result, the feed-forward sublayer has three linear layers and an element-wise multiplication.
  • Uses RMSNorm as the normalization function instead of LayerNorm

After the entire transformer stack, one more normalization layer is applied to the hidden state. The base model refers to the neural network from the input embedding through this final normalization layer.

For pretraining, you want the model to learn to predict the next token in a sequence. To do this, you add a pretraining head to the model—a linear layer that projects the hidden state to the vocabulary size. The softmax function is then used to get the probability distribution over the vocabulary.

Creating a Llama Model for Pretraining

There are many implementations of the Llama model available online. You can use existing code or create one using the Hugging Face transformers library if you care more about training results than understanding the model architecture. For example, the model created below is a Llama model with 12 layers and a pretraining head:

However, creating a Llama model from scratch using PyTorch is not difficult. First, let’s define a data class to hold the model configuration:

The model should have a fixed vocabulary size that matches the tokenizer you use. The maximum sequence length is a parameter for the rotary position embeddings and should be large enough to accommodate the longest sequence in your dataset. The hidden size, intermediate size, number of layers, and number of attention heads are hyperparameters that directly determine the model size and training or inference speed.

Rotary Position Embeddings

Let’s first implement the rotary position embeddings. Unlike other modules, rotary position embeddings have no learnable parameters. Instead, they precompute the cosine and sine matrices and save them as buffers. Right before attention is computed, RoPE is applied to the query and key matrices. In formula, RoPE is applied as:

$$
\begin{aligned}
Q_{s,i} &= Q_{s,i} \cos(s\theta_i) – Q_{s,\frac{d}{2}+i} \sin(s\theta_i) \\
Q_{s,\frac{d}{2}+i} &= Q_{s,i} \sin(s\theta_i) + Q_{s,\frac{d}{2}+i} \cos(s\theta_i)
\end{aligned}
$$

where $Q_{s,i}$ is the $i$-th element of the token embedding of the query matrix $Q$ at sequence position $s$. The length of the token embedding (also known as the hidden size or the model dimension) is $d$. Applying RoPE to the key matrix $K$ is similar. The frequency term $\theta_i$ is computed as:

$$
\theta_i = \frac{1}{10000^{2i/d}}
$$

To implement RoPE, you can use the following:

RoPE is a function to mutate the input tensor. To test the above RoPE module, you can do the following:

Self-Attention Sub-Layer

The self-attention sub-layer is the core of the transformer model. It is responsible for attending to the context of the input sequence. In the Llama model, grouped query attention (GQA) is used, in which the key and value matrices are projected to fewer heads than the query matrix.

You do not need to implement GQA from scratch, as PyTorch provides a built-in GQA implementation in the scaled_dot_product_attention function. You just need to pass the enable_gqa=True argument to the function.

Below is how you can implement the self-attention sub-layer:

This is a few lines of code, but not trivial. The attention block implements GQA self-attention with RoPE.

Upon entry, you project the input tensor to query, key, and value tensors using linear layers. Then split the last dimension (corresponding to the token embeddings) into two parts: one for the number of heads and one for the per-head dimension. Note that the original tensor has shape (batch_size, sequence_length, hidden_size), and after projection and reshaping, it has shape (batch_size, sequence_length, num_heads, head_dim). You should transpose this to (batch_size, num_heads, sequence_length, head_dim) before applying attention.

Computing GQA is straightforward. You simply call the scaled_dot_product_attention function with enable_gqa=True and pass the query, key, and value tensors. Note that the query tensor has more heads than the key and value tensors, but the number of query heads must be a multiple of the number of key and value heads.

For the output, you restore the tensor to its original shape and project it again with a linear layer.

SwiGLU Sub-Layer

The feed-forward sub-layer in the Llama model uses SwiGLU as the activation function. Mathematically, SwiGLU is defined as:

$$
\begin{aligned}
\text{SwiGLU}(x) &= \text{swish}(xW_1 + b_1) \otimes (xW_2 + b_2) \\
\text{swish}(x) &= \frac{x}{1 + e^{-x}}
\end{aligned}
$$

The symbol $\otimes$ represents element-wise multiplication of the two tensors. Effectively, the SwiGLU function models a quadratic function.

In GPT and Llama models, you expand the input tensor dimension to a larger size, apply the activation function, and then project it back to the original dimension. This dimensional expansion is believed to enable learning more complex relationships between input and output.

PyTorch does not provide a single function for implementing SwiGLU. But you can implement it using the F.silu function and the element-wise multiplication, as follows:

Transformer Block

A GPT or Llama model is a stack of transformer blocks. To implement a transformer block, you process the input tensor through the self-attention sub-layer and the feed-forward sub-layer. The implementation is as follows:

The attention and feed-forward sub-layers are the main components of the transformer block. You apply normalization before each sub-layer and add the residual connection afterward.

The forward() method accepts an input tensor, a RoPE module, and an attention mask tensor. The RoPE and attention mask are shared by all transformer blocks and should be created once, outside the blocks.

The Base Llama Model and the Pretraining Head

The base Llama model ties all the transformer blocks together. It creates the rotary position embeddings module to share across all transformer blocks.

With all the building blocks in place, the base Llama model is straightforward to implement:

The base model expects an integer tensor of token IDs as input. This is transformed into a floating-point tensor of embedding vectors using the embedding layer. These are the “hidden states” transformed by each block. The output from the last block is normalized again to produce the model’s final output.

Note that a rotary position embedding module is created in the constructor and passed to each transformer block in the forward() method. This allows efficient sharing of RoPE calculations across all blocks.

As a final step, you would need a pretraining model, as follows:

This is simply an additional linear layer applied to the base model’s output. This linear layer is large because it projects the hidden state to the vocabulary size. The output of this layer is the logits for the next token in the vocabulary. You should not include a softmax layer in the model for two reasons:

1. In PyTorch, the cross-entropy loss function expects the logits, not the probabilities.
2. During inference, you may want to compute the softmax probabilities with a custom temperature. It is better to leave it outside the model to retain the flexibility.

That’s all you need to create a Llama model for pretraining!

Attention Masks

From the code above, you can see that the Llama model expects an integer tensor of token IDs and an attention mask tensor as input. The token ID tensor has shape (batch_size, sequence_length), and the attention mask tensor has shape (batch_size, 1, sequence_length, sequence_length). This is because the attention function produces attention weights of shape (batch_size, num_heads, sequence_length, sequence_length), and the mask must be broadcastable to this shape.

You could modify the model to accept only the token ID tensor and automatically generate a causal attention mask. However, for training, it’s better to control the attention mask explicitly so you can mask out padding tokens or other unwanted tokens in the batch.

Generating a mask that handles both causal masking and padding is straightforward. Below is code to generate a random input tensor with some padding tokens and then create the corresponding causal and padding masks:

A sequence of length $N$ (excluding padding tokens) should have a triangular mask matrix of shape $N\times N$ with all above-diagonal elements (and all elements outside this $N\times N$ range) set to $-\infty$.

To test the model you built, you can run it with a random input to ensure no errors are raised:

You should see that this model is really small, only 179.45M parameters. But it is good enough as a toy project to learn how to implement a language model from scratch.

Below is the complete code:

Variations in the Architecture

The code above implements the base Llama model. It is straightforward to convert the implementation to a GPT model by removing RoPE, replacing it with a learned positional embeddings layer in the base model, replacing RMSNorm with LayerNorm, disabling GQA, and using a GELU feed-forward sub-layer instead of SwiGLU.

If you want to design your own decoder-only transformer model, you are not required to follow the exact architecture of Llama or GPT. Besides basic hyperparameters such as vocabulary size or number of transformer layers, here are some variations you may consider:

  • In the implementation above, the nn.Linear layers deliberately use bias=False to skip the bias term. This slightly reduces model size and accelerates training. You may want to add the bias term to make the model more flexible in learning complex patterns.
  • The feed-forward sublayer consists of an upward projection, a SwiGLU activation, and a downward projection. This is the “two-layer MLP” pattern. You could explore a three-layer MLP by adding an additional projection and activation layer. Three-layer MLPs are known to learn more complex patterns than two-layer MLPs.
  • Newer transformer models (such as Llama 4) use a Mixture-of-Experts (MoE) to replace the feed-forward sublayer. This is more efficient to train than simply increasing the feed-forward sub-layer size and is useful for larger models that need to learn more complex patterns.
  • Dropout can be added to prevent overfitting. This is less common in large models, since they are usually underfit. But if you do, there are three places where dropout is commonly applied: (1) In the feed-forward sublayer right after the nonlinear activation. (2) In the attention sublayer, applied to the attention scores calculated. In the implementation above, the PyTorch function F.scaled_dot_product_attention() takes a dropout argument to enable it. (3) At the output of sublayers, right before the residual connection.

Further Readings

Below are some resources that you may find useful:

Summary

In this article, you learned how to create a Llama model for next-token prediction from scratch using PyTorch. Specifically, you learned:

  • The building blocks of a decoder-only transformer model
  • The architectural differences between Llama and GPT
  • How to implement all the building blocks of a Llama model
  • What are the variations in the architecture that you may explore

No comments yet.

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.