Archive | Inference from Transformer Models

Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient

By Yoyo Chan on May 31, 2026 in Inference from Transformer Models 0

In the previous article, we saw how a language model processes a prompt during prefill, then generates tokens one at a time during decode, and uses KV cache to avoid repeated computation. In the real world, inference servers handle hundreds or thousands of requests at the same time. How a server schedules those requests determines […]

From Prompt to Prediction: Understanding Prefill, Decode, and the KV Cache in LLMs

By Yoyo Chan on April 1, 2026 in Inference from Transformer Models 0

In the previous article, we saw how a language model converts logits into probabilities and samples the next token. But where do these logits come from? In this tutorial, we take a hands-on approach to understand the generation pipeline: How the prefill phase processes your entire prompt in a single parallel pass How the decode […]

How LLMs Choose Their Words: A Practical Walk-Through of Logits, Softmax and Sampling

By Yoyo Chan on December 14, 2025 in Inference from Transformer Models 2

Large Language Models (LLMs) can produce varied, creative, and sometimes surprising outputs even when given the same prompt. This randomness is not a bug but a core feature of how the model samples its next token from a probability distribution. In this article, we break down the key sampling strategies and demonstrate how parameters such […]

Navigation

Archive | Inference from Transformer Models