From Prompt to Prediction: Understanding Prefill, Decode, and the KV Cache in LLMs

In the previous article, we saw how a language model converts logits into probabilities and samples the next token. But where do these logits come from?

In this tutorial, we take a hands-on approach to understand the generation pipeline:

  • How the prefill phase processes your entire prompt in a single parallel pass
  • How the decode phase generates tokens one at a time using previously computed context
  • How the KV cache eliminates redundant computation to make decoding efficient

By the end, you will understand the two-phase mechanics behind LLM inference and why the KV cache is essential for generating long responses at scale.

Let’s get started.

From Prompt to Prediction: Understanding Prefill, Decode, and the KV Cache in LLMs
Photo by Neda Astani. Some rights reserved.

Overview

This article is divided into three parts; they are:

  • How Attention Works During Prefill
  • The Decode Phase of LLM Inference
  • KV Cache: How to Make Decode More Efficient

How Attention Works During Prefill

Consider the prompt:

Today’s weather is so …

As humans, we can infer the next token should be an adjective, because the last word “so” is a setup. We also know it probably describes weather, so words like “nice” or “warm” are more likely than something unrelated like “delicious“.

Transformers arrive at the same conclusion through attention. During prefill, the model processes the entire prompt in a single forward pass. Every token attends to itself and all tokens before it, building up a contextual representation that captures relationships across the full sequence.

The mechanism behind this is the scaled dot-product attention formula:

$$
\text{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
$$

We will walk through this concretely below.

To make the attention computation traceable, we assign each token a scalar value representing the information it carries:

Position Tokens Values
1 Today 10
2 weather 20
3 is 1
4 so 5

Words like “is” and “so” carry less semantic weight than “Today” or “weather“, and as we’ll see, attention naturally reflects this.

Attention Heads

In real transformers, attention weights are continuous values learned during training through the $Q$ and $K$ dot product. The behavior of attention heads are learned and usually impossible to describe. No head is hardwired to “attend to even positions”. The four rules below are simplified illustration to make attention mechanism more intuitive, while the weighted aggregation over $V$ is the same.

Here are the rules in our toy example:

  1. Attend to tokens at even number positions
  2. Attend to the last token
  3. Attend to the first token
  4. Attend to every token

For simplicity in this example, the outputs from these heads are then combined (averaged).

Let’s walk through the prefill process:

Today

  1. Even tokens → none
  2. Last token → Today → 10
  3. First token → Today → 10
  4. All tokens → Today → 10

weather

  1. Even tokens → weather → 20
  2. Last token → weather → 20
  3. First token → Today → 10
  4. All tokens → average(Today, weather) → 15

is

  1. Even tokens → weather → 20
  2. Last token → is → 1
  3. First token → Today → 10
  4. All tokens → average(Today, weather, is) → 10.33

so

  1. Even tokens → average(weather, so) → 12.5
  2. Last token → so → 5
  3. First token → Today → 10
  4. All tokens → average(Today, weather, is, so) → 9

Parallelizing Attention

If the prompt contained 100,000 tokens, computing attention step-by-step would be extremely slow. Fortunately, attention can be expressed as tensor operations, allowing all positions to be computed in parallel.

This is the key idea of prefill phase in LLM inference: When you provide a prompt, there are multiple tokens in it and they can be processed in parallel. Such parallel processing helps speed up the response time for the first token generated.

To prevent tokens from seeing future tokens, we apply a causal mask, so they can only attend to itself and earlier tokens.

Output:

Now, we can start writing the “rules” for the 4 attention heads.

Rather than computing scores from learned $Q$ and $K$ vectors, we handcraft them directly to match our four attention rules. Each head produces a score matrix of shape (n, n), with one score per query-key pair, which gets masked and passed through softmax to produce attention weights:

Output:

The result of this step is called a context vector, which represents a weighted summary of all previous tokens.

From contexts to logits

Each attention head has learned to pick up on different patterns in the input. Together, the four context values [12.5, 5.0, 10.0, 9.0] form a summary of what “Today’s weather is so…” represents. It will then project to a matrix, which each column encodes how strong a given vocabulary is associated with each attention head’s signal, to give logit score per word.

For our example, let’s say we have “nice”, “warm”, and “delicious” in the vocab:

So the logits for “nice” and “warm” are much higher than “delicious”.

The Decode Phase of LLM Inference

Now suppose the model generates the next token: “nice“. The task is now to generate the next token with the extended prompt:

Today’s weather is so nice …

The first four words in the extended prompt are the same as the original prompt. And now we have the fifth word in the prompt.

During decode, we do not recompute attention for all previous tokens as the result would be the same. Instead, we compute attention only for the new token to save time and compute resources. This produces a single new attention row.

Output:

Now, we apply the 4 attention heads and compute the new context vector:

Output:

However, unlike prefill where the entire prompt is processed in parallel, decoding must generate tokens one at a time (autoregressively) because the future tokens have not yet been generated. Without caching, every decode step would recompute keys and values for all previous tokens from scratch, making the total work across all decode steps $O(n^2)$ in sequence length. KV cache reduces this to $O(n)$ by computing each token’s $K$ and $V$ exactly once.

KV Cache: How to Make Decode More Efficient

To make the autoregressive docoding efficient, we can store the keys ($K$) and values ($V$) for every token separately for each attention head. In this simplified example we would use only one cache. Then, during decoding, when a new token is generated, the model does not recompute keys and values for all previous tokens. It computes the query for the new token, and attends to the cached keys and values from previous tokens.

If we look at the previous code again, we can see that there is no need to recompute $K$ for the entire tensor:

Instead, we can simply compute K for the new position, and attach it to the K matrix we have already computed and saved in cache:

Here’s the full code for decode phase using KV cache:

Output:

Notice this is identical to the result we computed without the cache. KV cache doesn’t change what the model computes, but it eliminates redundant computations.

KV cache is different from the cache in other application that the object stored is not replaced but updated. Every new token added to the prompt appends a new row to the tensor stored. Implementing a KV cache that can efficiently update the tensor is the key to make LLM inference faster.

Further Readings

Below are some resources that you may find useful:

Summary

In this article, we walked through the two phases of LLM inference. During prefill, the full prompt is processed in one parallel forward pass and the keys and values for every token are computed and stored. During decode, the model generates one token at a time, using only the new token’s query against the cached keys and values to avoid redundant recomputation. Prefill warms up the KV cache and decode updates it. Faster prefill means sooner you see the first token in the response and faster decode means faster you see the rest of the response. Together, these two phases explain why LLMs can process long prompts quickly but generate output token by token, and why KV cache is essential for making that generation practical at scale.

No comments yet.

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.