Text Generation
Next Token Prediction
Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Pre-training
Step
1/ 7
Next Token Prediction
Text generation in LLMs is built on a single operation: predict the next token.
Given a sequence of tokens (the "context" or "prompt"), the model produces a probability distribution over the entire vocabulary — a number for every possible next token indicating how likely it is.
The pipeline for each prediction:
- •Tokenize the input text into token IDs: "The cat sat on the" → [464, 3797, 3332, 319, 262]
- •Embed each token ID into a vector (lookup in the embedding table)
- •Process through all transformer layers (self-attention + feed-forward, repeated N times)
- •Project the final hidden state to vocabulary size: a vector of ~50,000 numbers (logits)
- •Apply softmax to convert logits to probabilities that sum to 1.0
- •Select the next token from this distribution
This final selection step is where decoding strategies come in — greedy, sampling, beam search, etc. The choice of strategy dramatically affects the quality and diversity of generated text.
The model doesn't "think" about what to say — it computes a mathematical function that maps input tokens to a probability distribution over next tokens. Yet this simple process produces remarkably coherent text.
Prompt: "The cat sat on the ___" → Predict Next Token
The
Pos: 0
cat
Pos: 1
sat
Pos: 2
on
Pos: 3
the
Pos: 4
???
Pos: 5
The Next-Token Prediction Pipeline
| Pipeline Stage | Input | Output | Shape |
|---|---|---|---|
| 1. Tokenize | "The cat sat on the" | [464, 3797, 3332, 319, 262] | 5 integers |
| 2. Embed | Token IDs | Embedding vectors | 5 x 4096 |
| 3. Transform | Embeddings | Contextualized representations | 5 x 4096 |
| 4. Project | Last position hidden state | Raw logits over vocabulary | 1 x 50,000 |
| 5. Softmax | Logits | Probability distribution | 1 x 50,000 (sums to 1.0) |
| 6. Select | Probabilities | Chosen token: "mat" | 1 integer |