Author Archive | Adrian Tam

jay-9l-dgA51CJY-unsplash

Building a Decoder-Only Transformer Model Like Llama-2 and Llama-3

The large language models today are a simplified form of the transformer model. They are called decoder-only models because their role is similar to the decoder part of the transformer, which generates an output sequence given a partial sequence as input. Architecturally, they are closer to the encoder part of the transformer model. In this […]

Continue Reading
sorasak-_UIN-pFfJ7c-unsplash

Building a Transformer Model for Language Translation

The Transformer architecture, introduced in 2017, revolutionized sequence-to-sequence tasks like language translation by eliminating the need for recurrent neural networks. Instead, it relies on self-attention mechanisms to process input sequences. In this post, you’ll learn how to build a Transformer model from scratch. In particular, you will understand: How self-attention processes input sequences How transformer […]

Continue Reading
esther-t-ZVsAufJ60Mc-unsplash

Building a Seq2Seq Model with Attention for Language Translation

The attention mechanism, introduced by Bahdanau et al. in 2014, significantly improved sequence-to-sequence (seq2seq) models. In this post, you’ll learn how to build and train a seq2seq model with attention for language translation, focusing on: Why attention mechanisms are essential How to implement attention in a seq2seq model Let’s get started. Overview This post is […]

Continue Reading
pourya-gohari-c2Z_uo7nyC0-unsplash

Building a Plain Seq2Seq Model for Language Translation

Sequence-to-sequence (seq2seq) models are powerful architectures for tasks that transform one sequence into another, such as machine translation. These models employ an encoder-decoder architecture, where the encoder processes the input sequence and the decoder generates an output sequence based on the encoder’s output. The attention mechanism was developed for seq2seq models, and understanding how seq2seq […]

Continue Reading
david-emrich-9a0S_8bU0lo-unsplash

Skip Connections in Transformer Models

Transformer models consist of stacked transformer layers, each containing an attention sublayer and a feed-forward sublayer. These sublayers are not directly connected; instead, skip connections combine the input with the processed output in each sublayer. In this post, you will explore skip connections in transformer models. Specifically: Why skip connections are essential for training deep […]

Continue Reading
realfish-0MvkW2nYysk-unsplash

Mixture of Experts Architecture in Transformer Models

Transformer models have proven highly effective for many NLP tasks. While scaling up with larger dimensions and more layers can increase their power, this also significantly increases computational complexity. Mixture of Experts (MoE) architecture offers an elegant solution by introducing sparsity, allowing models to scale efficiently without proportional computational cost increases. In this post, you […]

Continue Reading
duong-thinh-ZTMPQW5GSZM-unsplash

Linear Layers and Activation Functions in Transformer Models

Attention operations are the signature of transformer models, but they are not the only building blocks. Linear layers and activation functions are equally essential. In this post, you will learn about: Why linear layers and activation functions enable non-linear transformations The typical design of feed-forward networks in transformer models Common activation functions and their characteristics […]

Continue Reading
redd-francisco-mE_yfvS0TSY-unsplash

LayerNorm and RMS Norm in Transformer Models

Normalization layers are crucial components in transformer models that help stabilize training. Without normalization, models often fail to converge or behave poorly. This post explores LayerNorm, RMS Norm, and their variations, explaining how they work and their implementations in modern language models. Let’s get started. Overview This post is divided into five parts; they are: […]

Continue Reading
caleb-jack-jUxMsNZZCJ8-unsplash

A Gentle Introduction to Attention Masking in Transformer Models

Attention mechanisms in transformer models need to handle various constraints that prevent the model from attending to certain positions. This post explores how attention masking enables these constraints and their implementations in modern language models. Let’s get started. Overview This post is divided into four parts; they are: Why Attention Masking is Needed Implementation of […]

Continue Reading

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.