Author Archive | Adrian Tam

Building a Decoder-Only Transformer Model Like Llama-2 and Llama-3

By Adrian Tam on September 11, 2025 in Building Transformer Models 0

The large language models today are a simplified form of the transformer model. They are called decoder-only models because their role is similar to the decoder part of the transformer, which generates an output sequence given a partial sequence as input. Architecturally, they are closer to the encoder part of the transformer model. In this […]

Building a Transformer Model for Language Translation

By Adrian Tam on September 12, 2025 in Building Transformer Models 2

The Transformer architecture, introduced in 2017, revolutionized sequence-to-sequence tasks like language translation by eliminating the need for recurrent neural networks. Instead, it relies on self-attention mechanisms to process input sequences. In this post, you’ll learn how to build a Transformer model from scratch. In particular, you will understand: How self-attention processes input sequences How transformer […]

Building a Seq2Seq Model with Attention for Language Translation

By Adrian Tam on September 12, 2025 in Building Transformer Models 0

The attention mechanism, introduced by Bahdanau et al. in 2014, significantly improved sequence-to-sequence (seq2seq) models. In this post, you’ll learn how to build and train a seq2seq model with attention for language translation, focusing on: Why attention mechanisms are essential How to implement attention in a seq2seq model Let’s get started. Overview This post is […]

Building a Plain Seq2Seq Model for Language Translation

By Adrian Tam on September 12, 2025 in Building Transformer Models 0

Sequence-to-sequence (seq2seq) models are powerful architectures for tasks that transform one sequence into another, such as machine translation. These models employ an encoder-decoder architecture, where the encoder processes the input sequence and the decoder generates an output sequence based on the encoder’s output. The attention mechanism was developed for seq2seq models, and understanding how seq2seq […]

Skip Connections in Transformer Models

By Adrian Tam on September 12, 2025 in Building Transformer Models 0

Transformer models consist of stacked transformer layers, each containing an attention sublayer and a feed-forward sublayer. These sublayers are not directly connected; instead, skip connections combine the input with the processed output in each sublayer. In this post, you will explore skip connections in transformer models. Specifically: Why skip connections are essential for training deep […]

Mixture of Experts Architecture in Transformer Models

By Adrian Tam on November 28, 2025 in Building Transformer Models 2

Transformer models have proven highly effective for many NLP tasks. While scaling up with larger dimensions and more layers can increase their power, this also significantly increases computational complexity. Mixture of Experts (MoE) architecture offers an elegant solution by introducing sparsity, allowing models to scale efficiently without proportional computational cost increases. In this post, you […]

Linear Layers and Activation Functions in Transformer Models

By Adrian Tam on September 12, 2025 in Building Transformer Models 0

Attention operations are the signature of transformer models, but they are not the only building blocks. Linear layers and activation functions are equally essential. In this post, you will learn about: Why linear layers and activation functions enable non-linear transformations The typical design of feed-forward networks in transformer models Common activation functions and their characteristics […]

LayerNorm and RMS Norm in Transformer Models

By Adrian Tam on September 12, 2025 in Building Transformer Models 0

Normalization layers are crucial components in transformer models that help stabilize training. Without normalization, models often fail to converge or behave poorly. This post explores LayerNorm, RMS Norm, and their variations, explaining how they work and their implementations in modern language models. Let’s get started. Overview This post is divided into five parts; they are: […]

A Gentle Introduction to Attention Masking in Transformer Models

By Adrian Tam on January 18, 2026 in Building Transformer Models 0

Attention mechanisms in transformer models need to handle various constraints that prevent the model from attending to certain positions. This post explores how attention masking enables these constraints and their implementations in modern language models. Let’s get started. Overview This post is divided into four parts; they are: Why Attention Masking is Needed Implementation of […]

victoriano-izquierdo-29Rh5DOS5Qs-unsplash

A Gentle Introduction to Multi-Head Latent Attention (MLA)

By Adrian Tam on January 18, 2026 in Building Transformer Models 0

Not all Transformer models are called “large language models” because you can build a very small model using the Transformer architecture. The truly large Transformer models are often impractical to use at home because they’re too large to fit on a single computer and too slow to run without a cluster of GPUs. The recent […]

← Previous 1 2 3 4 … 15 Next →