Archive | Inference from Transformer Models

petra-reid-WYvxaZGBebg-unsplash

Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient

In the previous article, we saw how a language model processes a prompt during prefill, then generates tokens one at a time during decode, and uses KV cache to avoid repeated computation. In the real world, inference servers handle hundreds or thousands of requests at the same time. How a server schedules those requests determines […]

Continue Reading
neda-astani-KWTkd7mHqKE-unsplash

From Prompt to Prediction: Understanding Prefill, Decode, and the KV Cache in LLMs

In the previous article, we saw how a language model converts logits into probabilities and samples the next token. But where do these logits come from? In this tutorial, we take a hands-on approach to understand the generation pipeline: How the prefill phase processes your entire prompt in a single parallel pass How the decode […]

Continue Reading
colton-duke-UExx0KnnkjY-unsplash

How LLMs Choose Their Words: A Practical Walk-Through of Logits, Softmax and Sampling

Large Language Models (LLMs) can produce varied, creative, and sometimes surprising outputs even when given the same prompt. This randomness is not a bug but a core feature of how the model samples its next token from a probability distribution. In this article, we break down the key sampling strategies and demonstrate how parameters such […]

Continue Reading

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.