Author Archive | Adrian Tam

A Gentle Introduction to Language Model Fine-tuning

By Adrian Tam on January 12, 2026 in Training Transformer Models 0

After pretraining, a language model learns about human languages. You can enhance the model’s domain-specific understanding by training it on additional data. You can also train the model to perform specific tasks when you provide a specific instruction. These additional training after pretraining is called fine-tuning. In this article, you will learn how to fine-tune […]

Train Your Large Model on Multiple GPUs with Tensor Parallelism

By Adrian Tam on January 24, 2026 in Training Transformer Models 0

Tensor parallelism is a model-parallelism technique that shards a tensor along a specific dimension. It distributes the computation of a tensor across multiple devices with minimal communication overhead. This technique is suitable for models with very large parameter tensors where even a single matrix multiplication is too large to fit on a single GPU. In […]

Train Your Large Model on Multiple GPUs with Fully Sharded Data Parallelism

By Adrian Tam on January 24, 2026 in Training Transformer Models 0

Some language models are too large to train on a single GPU. In addition to creating the model as a pipeline of stages, as in Pipeline Parallelism, you can split the model across multiple GPUs using Fully Sharded Data Parallelism (FSDP). In this article, you will learn how to use FSDP to split models for […]

Train Your Large Model on Multiple GPUs with Pipeline Parallelism

By Adrian Tam on January 24, 2026 in Training Transformer Models 0

Some language models are too large to train on a single GPU. When the model fits on a single GPU but cannot be trained with a large batch size, you can use data parallelism. However, when the model is too large to fit on a single GPU, you need to split it across multiple GPUs. […]

Training a Model on Multiple GPUs with Data Parallelism

By Adrian Tam on January 12, 2026 in Training Transformer Models 0

Training a large language model is slow. If you have multiple GPUs, you can accelerate training by distributing the workload across them to run in parallel. In this article, you will learn about data parallelism techniques. In particular, you will learn about: What is data parallelism The difference between Data Parallel and Distributed Data Parallel […]

Train a Model Faster with torch.compile and Gradient Accumulation

By Adrian Tam on January 12, 2026 in Training Transformer Models 2

Training a language model with a deep transformer architecture is time-consuming. However, there are techniques you can use to accelerate training. In this article, you will learn about: Using torch.compile() to speed up the model Using gradient accumulation to train a model with a larger effective batch size Let’s get started! Overview This article is […]

Training a Model with Limited Memory using Mixed Precision and Gradient Checkpointing

By Adrian Tam on January 12, 2026 in Training Transformer Models 0

Training a language model is memory-intensive, not only because the model itself is large but also because training data batches often contain long sequences. Training a model with limited memory is challenging. In this article, you will learn techniques that enable model training in memory-constrained environments. In particular, you will learn about: Low-precision floating-point numbers […]

Evaluating Perplexity on Language Models

By Adrian Tam on January 21, 2026 in Training Transformer Models 0

A language model is a probability distribution over sequences of tokens. When you train a language model, you want to measure how accurately it predicts human language use. This is a difficult task, and you need a metric to evaluate the model. In this article, you will learn about the perplexity metric. Specifically, you will […]

Pretraining a Llama Model on Your Local GPU

By Adrian Tam on January 21, 2026 in Training Transformer Models 4

Decoder-only language models like Llama are usually trained using self-supervised learning objectives on large amounts of text. This is called pretraining to distinguish it from later fine-tuning steps on specific tasks. In this article, you will learn how to pretrain a Llama model on a local GPU. Specifically, you will learn how to: Prepare the […]

Rotary Position Embeddings for Long Context Length

By Adrian Tam on January 12, 2026 in Training Transformer Models 0

Rotary Position Embeddings (RoPE) is a technique for encoding token positions in a sequence. It is widely used in many models and works well for standard context lengths. However, it requires adaptation for longer contexts. In this article, you will learn how RoPE is adapted for long context length. Let’s get started. Overview This article […]

1 2 … 15 Next →