When a language model predicts the next token, it outputs a probability distribution over the entire vocabulary. The simplest approach is greedy decoding — always pick the token with the highest probability.

But greedy decoding has serious problems:

1. Boring, repetitive text. Greedy decoding tends to produce generic, safe outputs. "The weather is nice. The weather is nice. The weather is nice..." The most probable next token is often the most predictable one.

2. No diversity. Ask the model to write a story ten times with greedy decoding and you get the exact same story every time. There's no randomness.

3. Suboptimal sequences. The locally most-probable token at each step doesn't always lead to the globally best sequence. "The president of the" → greedy picks "United" every time, missing creative alternatives.

Sampling introduces controlled randomness: instead of always picking the top token, we sample from the distribution, occasionally choosing less likely but still reasonable tokens. The key question is how much randomness to introduce.

Token

Probability

Greedy picks?

the

0.4245

Yes (always)

cat

0.2575

Never

sat

0.1279

Never

0.0776

Never

mat

0.0470

Never

dog

0.0348

Never

ran

0.0211

Never

big

0.0095

Never

Problem

Example

Impact

Repetition

"The cat sat. The cat sat. The cat sat."

Outputs loop on high-probability patterns

No creativity

Same output every run, no variation

Useless for creative writing, brainstorming

Local optima

Picking "United" after "president of the" every time

Misses globally better sequences

Degeneration

Long outputs devolve into repeated phrases

Quality degrades with length

Sampling Strategies: Temperature, Top-k, Top-p — Lesson Content

Learn how temperature, top-k, and top-p control the randomness and creativity of LLM text generation.

When a language model predicts the next token, it outputs a probability distribution over its vocabulary. How you sample from that distribution dramatically affects output quality. Greedy decoding (always picking the top token) produces boring, repetitive text. Sampling strategies like temperature scaling, top-k filtering, and nucleus (top-p) sampling introduce controlled randomness for more natural, diverse, and creative outputs. You'll see pre-computed probability distributions at different temperatures and learn which settings work best for different tasks.

Learning Objectives

Explain why greedy decoding produces poor-quality text
Describe how temperature reshapes a probability distribution
Compare top-k and top-p filtering approaches
Choose appropriate sampling settings for different tasks

Step 1: Why Not Always Pick the Most Likely Token?

When a language model predicts the next token, it outputs a probability distribution over the entire vocabulary. The simplest approach is **greedy decoding** — always pick the token with the highest probability. But greedy decoding has serious problems: **1. Boring, repetitive text.** Greedy decoding tends to produce generic, safe outputs. "The weather is nice. The weather is nice. The weather is nice..." The most probable next token is often the most predictable one. **2. No diversity.** Ask the model to write a story ten times with greedy decoding and you get the exact same story every time. There's no randomness. **3. Suboptimal sequences.** The locally most-probable token at each step doesn't always lead to the globally best sequence. "The president of the" → greedy picks "United" every time, missing creative alternatives. **Sampling** introduces controlled randomness: instead of always picking the top token, we sample from the distribution, occasionally choosing less likely but still reasonable tokens. The key question is **how much randomness** to introduce.

Step 2: Temperature Scaling

**Temperature** is the simplest way to control randomness in sampling. It scales the logits (raw model outputs) before applying softmax. The formula is: **P(token_i) = softmax(logit_i / T)** Where T is the temperature: **T = 0.1 (sharp):** Dividing by 0.1 is like multiplying by 10 — differences between logits are amplified dramatically. The distribution becomes extremely peaked. The top token dominates with near-certainty. **T = 1.0 (normal):** No scaling. The model's natural distribution is preserved. This is the baseline. **T = 2.0 (flat):** Dividing by 2 compresses differences between logits. The distribution becomes flatter, giving low-probability tokens a bigger share. More creative but less coherent. **T → 0:** The distribution approaches a one-hot vector (greedy decoding). **T → ∞:** The distribution approaches uniform random (every token equally likely). Temperature doesn't change which token is most probable — it changes **how much more probable** the top token is compared to the rest.

P(token_i) = exp(logit_i / T) / Σ_j exp(logit_j / T)

Step 3: Top-k Sampling

**Top-k sampling** takes a different approach: instead of reshaping the entire distribution, it **truncates** it. Only the top k most probable tokens are kept; everything else is set to zero. The algorithm: 1. Sort tokens by probability (highest first) 2. Keep only the top k tokens 3. Set all other probabilities to 0 4. Re-normalize the remaining probabilities to sum to 1 5. Sample from this filtered distribution **Example with k=3:** Only "the", "cat", and "sat" survive. All other tokens are eliminated. We then re-normalize so these three probabilities sum to 1.0. **Benefits:** Prevents sampling from the "long tail" of low-probability tokens. You'll never get a completely nonsensical token because unlikely options are eliminated. **Problem:** k is fixed regardless of the distribution shape. If the model is very confident (one token has 95% probability), k=50 still allows 49 unlikely tokens. If the model is uncertain (many tokens around 5%), k=3 eliminates perfectly reasonable options. This motivates top-p sampling.

Step 4: Top-p (Nucleus) Sampling

**Top-p sampling** (also called **nucleus sampling**) adapts the number of tokens dynamically based on the distribution shape. Instead of keeping a fixed number of tokens, it keeps the smallest set of tokens whose cumulative probability exceeds a threshold p. The algorithm: 1. Sort tokens by probability (highest first) 2. Compute cumulative probabilities: P₁, P₁+P₂, P₁+P₂+P₃, ... 3. Find the smallest set where the cumulative sum ≥ p 4. Zero out everything else and re-normalize 5. Sample from this "nucleus" **Why this is better than top-k:** - When the model is **confident** (top token at 90%), only 1-2 tokens survive → focused output - When the model is **uncertain** (many tokens around 5%), many tokens survive → diverse output - The filtering automatically adapts to the distribution shape **Example with p=0.9:** We add tokens from most to least probable until we reach 90% cumulative probability. This might include 3 tokens when the distribution is peaked, or 10 tokens when it's flat. This adaptive behavior is why top-p is generally preferred over top-k in modern systems.

Step 5: Comparing Strategies Side by Side

Let's compare all the strategies we've seen on the same set of logits. This makes the differences concrete. **Temperature** reshapes the entire distribution — it makes peaks sharper or flatter, but every token keeps a non-zero probability. **Top-k** hard-cuts at a fixed rank — exactly k tokens survive, regardless of how confident the model is. **Top-p** hard-cuts at a cumulative probability threshold — the number of surviving tokens adapts to the distribution. **In practice, these are often combined:** - Temperature + Top-p: First reshape with temperature, then filter with top-p - Temperature + Top-k: First reshape with temperature, then keep only top-k - All three: Temperature → Top-k → Top-p (this is what many APIs use) The order matters: temperature first (it changes the shape), then top-k/top-p (they truncate). **Key insight:** There is no single "best" strategy. The right choice depends on your task. The next step covers practical guidelines.

Step 6: Practical Settings for Different Tasks

Choosing sampling parameters is an art informed by the task. Here are battle-tested defaults used in production systems: **Code generation:** Low temperature (0.0-0.2), top-p 0.9. Code has strict syntax — you want the model to be precise. A misplaced token causes a compilation error. Many coding assistants use T=0 (greedy) for single-line completions. **Factual Q&A:** Low temperature (0.0-0.3), no top-k/top-p needed. You want the single most likely answer. Higher temperature risks hallucination. **Creative writing:** Higher temperature (0.7-1.0), top-p 0.9-0.95. You want variety and surprise, but not nonsense. Top-p prevents the very unlikely tokens that would break coherence. **Brainstorming:** High temperature (0.8-1.2), top-p 0.95. Maximize diversity of ideas. Some nonsense is acceptable because you're filtering ideas manually. **Translation:** Low temperature (0.1-0.3). Translations have "correct" answers — you want accuracy over creativity. **The golden rule:** Start with T=1.0, top-p=0.9, then adjust. Lower temperature for precision tasks, raise it for creative tasks. Top-p=0.9 is almost always a good default.

Step 7: Test Your Understanding

You've learned how temperature, top-k, and top-p control the randomness of text generation. Let's test your understanding!

Prerequisites

Understanding of probability distributions
Basic knowledge of how LLMs generate text token by token

Key Concepts

Greedy Decoding
Temperature Scaling
Top-k Sampling
Nucleus (Top-p) Sampling
Softmax Distribution
Sampling Parameter Tuning

Sampling Strategies

Why Not Always Pick the Most Likely Token?

Greedy Decoding: Always Picks the Top Token

Problems with Greedy Decoding