Text Generation — Lesson Content
Learn how LLMs generate text one token at a time — from probability distributions to decoding strategies to knowing when to stop.
Text generation is where LLMs come alive — transforming a prompt into coherent, useful text. But how exactly does this work? The model computes a probability distribution over its entire vocabulary at each step, selects one token, and feeds it back as input.
This lesson walks through the complete generation pipeline: computing next-token probabilities with softmax, choosing tokens with greedy decoding (and its problems), confronting the vocabulary-size challenge, understanding the autoregressive generation loop with KV-caching, and implementing proper stopping conditions.
Learning Objectives
- Trace the pipeline from input tokens to next-token probability distribution
- Explain how softmax converts logits to probabilities
- Describe greedy decoding and its limitations
- Explain the vocabulary projection and its computational cost
- Describe the autoregressive generation loop and why KV-caching is essential
- List the main stopping conditions and when each is appropriate
Step 1: Next Token Prediction
Text generation in LLMs is built on a single operation: **predict the next token**.
Given a sequence of tokens (the "context" or "prompt"), the model produces a probability distribution over the entire vocabulary — a number for every possible next token indicating how likely it is.
**The pipeline for each prediction:**
1. **Tokenize** the input text into token IDs: "The cat sat on the" → [464, 3797, 3332, 319, 262]
2. **Embed** each token ID into a vector (lookup in the embedding table)
3. **Process** through all transformer layers (self-attention + feed-forward, repeated N times)
4. **Project** the final hidden state to vocabulary size: a vector of ~50,000 numbers (logits)
5. **Apply softmax** to convert logits to probabilities that sum to 1.0
6. **Select** the next token from this distribution
This final selection step is where **decoding strategies** come in — greedy, sampling, beam search, etc. The choice of strategy dramatically affects the quality and diversity of generated text.
The model doesn't "think" about what to say — it computes a mathematical function that maps input tokens to a probability distribution over next tokens. Yet this simple process produces remarkably coherent text.
Step 2: The Probability Distribution
After processing the input through the transformer, the model produces **logits** — raw scores for each token in the vocabulary. These are then converted to probabilities via the **softmax** function.
For our small 10-word vocabulary, given the context "The cat sat on the ___":
**Raw logits** (model output): one number per vocabulary word, before normalization. Higher logit = model thinks this word is more likely. Logits can be any real number — positive, negative, or zero.
**Softmax** converts logits to a valid probability distribution:
**P(word_i) = exp(logit_i) / sum_j(exp(logit_j))**
Properties of softmax:
- All probabilities are between 0 and 1
- They sum to exactly 1.0
- Higher logits get exponentially higher probabilities
- The difference between logits matters more than their absolute values
In our example, "mat" has the highest logit (3.8) and gets probability 0.741. The model is quite confident that "mat" comes next. But notice that "the" and "." also have non-trivial probabilities — the model considers multiple plausible continuations.
In a real LLM, this distribution is over ~50,000-100,000 tokens, and most get near-zero probability.
Step 3: Greedy Decoding
**Greedy decoding** is the simplest selection strategy: always pick the token with the highest probability.
At each step:
1. Compute the probability distribution over the vocabulary
2. Select argmax — the token with the highest probability
3. Append it to the sequence
4. Repeat
For "The cat sat on the ___": the model assigns the highest probability to "mat" (0.741), so greedy decoding selects "mat."
**Advantages:**
- Deterministic — same input always gives same output
- Fast — no sampling overhead
- Often produces grammatically correct, factually reasonable text
**Disadvantages:**
- **Repetitive:** Greedy decoding tends to get stuck in loops ("The cat sat on the mat. The cat sat on the mat. The cat...")
- **Boring:** It always chooses the "safest" continuation, missing creative or diverse options
- **Locally optimal, globally poor:** Picking the best word at each step doesn't guarantee the best overall sentence. "I want to" → greedy picks "go" but the best continuation might start with "make"
This is why more sophisticated strategies (temperature sampling, top-k, top-p) are used in practice. Greedy decoding is fine for factual tasks but poor for creative generation.
Step 4: The Vocabulary Problem
A real LLM doesn't choose from 10 words — it chooses from its entire vocabulary of **50,000-100,000 tokens** at every single step.
This creates the **vocabulary problem**: the final layer of the model must project a hidden state vector (e.g., 4,096 dimensions) to a vector of 50,000+ logits — one for each possible next token.
**The final projection:** logits = hidden_state @ W_output + b
- hidden_state: [4,096] — the contextualized representation of the sequence
- W_output: [4,096 x 50,000] — the "unembedding" or output projection matrix
- logits: [50,000] — raw scores for every token in the vocabulary
This single matrix multiplication involves **200 million** multiply-add operations — and it happens at every generation step.
**Memory implications:**
- W_output at FP16 = 50,000 x 4,096 x 2 bytes = **400 MB** just for the output layer
- Softmax over 50,000 entries at every step
- Often the embedding matrix and output projection matrix are **tied** (shared) to save parameters
**In practice**, most of the 50,000 probabilities are negligibly small. Only a handful of tokens have meaningful probability. This is why top-k and top-p sampling work — they ignore the vast "tail" of near-zero probabilities and only consider the plausible candidates.
Step 5: Autoregressive Generation Loop
LLM text generation is **autoregressive** — each generated token becomes part of the input for generating the next token. The model feeds its own output back as input, one token at a time.
**The generation loop:**
```
1. Start with prompt tokens: ["The", "cat", "sat", "on", "the"]
2. Forward pass → predict distribution → select "mat"
3. Append: ["The", "cat", "sat", "on", "the", "mat"]
4. Forward pass → predict distribution → select "."
5. Append: ["The", "cat", "sat", "on", "the", "mat", "."]
6. Forward pass → predict distribution → select "<EOS>"
7. Stop — EOS token reached.
```
**Why is this sequential?** Unlike training (where all positions are computed in parallel with causal masking), generation must be sequential because each new token depends on all previous ones. You can't predict token 7 until tokens 1-6 exist.
**KV-cache optimization:** A critical optimization — since previous tokens don't change, their key-value representations in each attention layer can be **cached** and reused. Without KV-caching, each new token would require re-computing attention for the entire sequence. With it, only the new token's computation is needed, making generation O(1) per token instead of O(n).
This is why LLM inference has two distinct phases:
- **Prefill:** Process the entire prompt in parallel (fast, like training)
- **Decode:** Generate tokens one at a time (slow, sequential)
# Autoregressive generation loop (simplified)
def generate(model, prompt_tokens, max_length=100):
tokens = list(prompt_tokens)
kv_cache = None
for _ in range(max_length):
logits, kv_cache = model.forward(
tokens[-1:], # Only the newest token
kv_cache=kv_cache # Reuse cached K,V
)
probs = softmax(logits / temperature)
next_token = sample(probs) # Or argmax for greedy
if next_token == EOS_TOKEN:
break
tokens.append(next_token)
return tokens
Step 6: Stopping Conditions
An autoregressive model will keep generating tokens forever unless we tell it to stop. There are several stopping conditions:
**1. End-of-Sequence (EOS) token:** The model generates a special token (e.g., `<|endoftext|>`, `</s>`, `<EOS>`) that signals it considers the response complete. This is the most natural stopping point — the model has learned during training when text should end.
**2. Maximum length:** A hard cap on the number of tokens generated (e.g., 4,096 tokens). Prevents runaway generation. If the model hasn't produced an EOS token by this limit, generation is forcibly stopped (possibly mid-sentence).
**3. Stop sequences:** Custom strings that trigger early stopping. For example, in a chat application, you might stop generation when the model outputs "User:" (indicating it's hallucinating the next user message). The API caller specifies these.
**4. Custom logic:** Applications may implement their own stopping rules — stop after N sentences, stop when a certain pattern is detected, stop when confidence drops below a threshold, etc.
**In practice, multiple conditions are combined:**
- Generate up to max_length tokens
- Stop early if EOS is generated
- Stop early if any stop sequence is produced
- The first condition triggered wins
Getting stopping conditions right is crucial for production systems. A model that doesn't stop wastes compute and money. A model that stops too early produces truncated, unhelpful responses.
Step 7: Test Your Understanding
You've learned how LLMs generate text — from computing probability distributions to selecting tokens to knowing when to stop. Let's check your understanding!
Prerequisites
- What Is an LLM
- Tokenization
- Basic understanding of probability distributions
Key Concepts
- Next-Token Prediction
- Logits and Softmax
- Greedy Decoding
- Vocabulary Projection
- Autoregressive Generation
- KV-Cache Optimization
- Stopping Conditions (EOS, Max Length)