Text generation in LLMs is built on a single operation: predict the next token.

Given a sequence of tokens (the "context" or "prompt"), the model produces a probability distribution over the entire vocabulary — a number for every possible next token indicating how likely it is.

The pipeline for each prediction:

•Tokenize the input text into token IDs: "The cat sat on the" → [464, 3797, 3332, 319, 262]
•Embed each token ID into a vector (lookup in the embedding table)
•Process through all transformer layers (self-attention + feed-forward, repeated N times)
•Project the final hidden state to vocabulary size: a vector of ~50,000 numbers (logits)
•Apply softmax to convert logits to probabilities that sum to 1.0
•Select the next token from this distribution

This final selection step is where decoding strategies come in — greedy, sampling, beam search, etc. The choice of strategy dramatically affects the quality and diversity of generated text.

The model doesn't "think" about what to say — it computes a mathematical function that maps input tokens to a probability distribution over next tokens. Yet this simple process produces remarkably coherent text.

Pipeline Stage	Input	Output	Shape
1. Tokenize	"The cat sat on the"	[464, 3797, 3332, 319, 262]	5 integers
2. Embed	Token IDs	Embedding vectors	5 x 4096
3. Transform	Embeddings	Contextualized representations	5 x 4096
4. Project	Last position hidden state	Raw logits over vocabulary	1 x 50,000
5. Softmax	Logits	Probability distribution	1 x 50,000 (sums to 1.0)
6. Select	Probabilities	Chosen token: "mat"	1 integer

Text Generation

Next Token Prediction

Prompt: "The cat sat on the ___" → Predict Next Token

The Next-Token Prediction Pipeline