Tokenization

Text to Numbers

Difficulty
Beginner
Duration
10-12 min
Prerequisites
What is an LLM
Step
1/ 7

Why Text Needs Numbers

Neural networks are mathematical machines — they perform matrix multiplications, additions, and nonlinear activations. They operate on numbers, not text. Before an LLM can process "Hello world," it must convert that text into a sequence of numbers.

This conversion process is called tokenization: splitting text into discrete units (tokens) and mapping each to an integer ID.

Why not just use ASCII/Unicode codes? You could represent "H" as 72, "e" as 101, etc. But character-level processing has problems:

  • Sequences become very long (a 500-word essay = ~2,500 characters)
  • Individual characters carry little meaning ("q" alone tells you almost nothing)
  • The model must learn to compose characters into words from scratch

Why not just use whole words? You could assign each word an ID: "Hello" = 1, "world" = 2. But:

  • The vocabulary becomes enormous (English has 170,000+ words)
  • Misspellings, new words, and rare terms get no representation
  • Morphologically related words ("run," "running," "runner") share nothing

The solution is subword tokenization — a middle ground that we'll explore next.

Text Must Be Converted to Numbers

H
Pos: 0
e
Pos: 1
l
Pos: 2
l
Pos: 3
o
Pos: 4
Pos: 5
[72, 101, 108, 108, 111]
Pos: 6

Three Tokenization Approaches

ApproachInput: "Hello world"TokensVocabulary Size
Character-levelH, e, l, l, o, ·, w, o, r, l, d11 tokens~256 (ASCII)
Word-levelHello, world2 tokens170,000+ (English)
Subword (BPE)Hell, o, ·world3 tokens~32,000-100,000

Tokenization — Lesson Content

Learn how text is split into tokens and converted to numbers — the essential first step before any LLM can process language.

Before an LLM can process "Hello world," that text must be converted into numbers. Tokenization is the bridge between human-readable text and the numerical world of neural networks. This lesson covers the three main approaches (character, word, and subword tokenization), dives into the BPE algorithm that powers GPT and most modern LLMs, walks through a concrete tokenization example, explores vocabulary size tradeoffs, and explains the special tokens ([CLS], [SEP], [PAD], [MASK]) that give models structural information.

Learning Objectives

  • Explain why neural networks require text to be converted to numbers
  • Compare character, word, and subword tokenization strategies
  • Describe the BPE algorithm at an intuitive level
  • Trace the tokenization pipeline from raw text to embedding vectors
  • Identify common special tokens and their purposes

Step 1: Why Text Needs Numbers

Neural networks are mathematical machines — they perform matrix multiplications, additions, and nonlinear activations. They operate on **numbers**, not text. Before an LLM can process "Hello world," it must convert that text into a sequence of numbers. This conversion process is called **tokenization**: splitting text into discrete units (tokens) and mapping each to an integer ID. **Why not just use ASCII/Unicode codes?** You could represent "H" as 72, "e" as 101, etc. But character-level processing has problems: - Sequences become very long (a 500-word essay = ~2,500 characters) - Individual characters carry little meaning ("q" alone tells you almost nothing) - The model must learn to compose characters into words from scratch **Why not just use whole words?** You could assign each word an ID: "Hello" = 1, "world" = 2. But: - The vocabulary becomes enormous (English has 170,000+ words) - Misspellings, new words, and rare terms get no representation - Morphologically related words ("run," "running," "runner") share nothing The solution is **subword tokenization** — a middle ground that we'll explore next.

Step 2: Character vs Word vs Subword Tokenization

Let's compare the three main approaches in detail: **Character-level tokenization** splits text into individual characters. The vocabulary is tiny (~256 for ASCII), so every possible input can be represented. But sequences are very long, and the model must learn spelling, morphology, and word boundaries from scratch. Used by some specialized models but rare in modern LLMs. **Word-level tokenization** splits on whitespace and punctuation. Each word is one token. Clean and intuitive, but the vocabulary must be enormous to cover rare words. Unknown words (not in vocabulary) are mapped to a special [UNK] token, losing information entirely. Used by older NLP models like early Word2Vec. **Subword tokenization** is the modern standard. It splits text into pieces that balance frequency and coverage: - Common words stay whole: "the" → ["the"] - Rare words are split: "tokenization" → ["token", "ization"] - Very rare words decompose further: "defenestration" → ["def", "en", "est", "ration"] The key insight: subword tokenization has a **finite vocabulary** (30K-100K tokens) but can represent **any possible text** by combining subword pieces. This is why GPT, Claude, and all modern LLMs use it.

Step 3: BPE Algorithm Intuition

**Byte Pair Encoding (BPE)** is the most common subword tokenization algorithm. Originally a data compression technique, it was adapted for NLP by Sennrich et al. (2016). Here's the intuition: **Starting point:** Begin with a vocabulary of individual characters: {a, b, c, ..., z, A, ..., Z, 0-9, punctuation}. **Iterative merging:** Repeatedly find the most frequent pair of adjacent tokens in the training corpus and merge them into a new token. **Step-by-step example on a tiny corpus:** 1. Start: vocabulary = {l, o, w, e, r, n, s, t, i, d, ...} 2. Most frequent pair: ("e", "s") → merge into "es" 3. Most frequent pair: ("es", "t") → merge into "est" 4. Most frequent pair: ("l", "o") → merge into "lo" 5. Most frequent pair: ("lo", "w") → merge into "low" 6. Most frequent pair: ("low", "est") → merge into "lowest" 7. Continue until vocabulary reaches target size (e.g., 50,000) **Result:** Common words like "the," "is," "and" become single tokens. Rare words get split into frequent subword pieces. The vocabulary size is a hyperparameter — typically 32K to 100K tokens for modern LLMs. The merge rules learned during training are saved and applied identically at inference time.

Step 4: Tokenization in Action

Let's trace through tokenizing "Hello world!" step by step using a BPE tokenizer (similar to GPT-2's). **Step 1 — Pre-processing:** The text is converted to bytes. Spaces are typically attached to the following word as a prefix (shown as "·"). So "Hello world!" becomes ["Hello", "·world", "!"]. **Step 2 — Apply BPE merges:** The tokenizer applies its learned merge rules. "Hello" is common enough to be a single token. "·world" is also a single token. "!" is a single token. Result: `["Hello", "·world", "!"]` → 3 tokens. **Step 3 — Map to IDs:** Each token is looked up in the vocabulary to get its integer ID: - "Hello" → 9906 - "·world" → 1917 - "!" → 0 **Step 4 — Into the model:** The integer sequence [9906, 1917, 0] is passed to the embedding layer, which converts each ID into a dense vector (e.g., 768 or 4096 dimensions). These vectors are what the transformer actually processes. Note that different tokenizers produce different splits! GPT-2, Claude, and Llama each have their own vocabulary and merge rules. The token IDs for the same text differ across models.

Step 5: Vocabulary Size Tradeoffs

The vocabulary size is a critical design choice. It determines how text is split and how large the embedding table is. There's a fundamental tradeoff: **Larger vocabulary (100K+ tokens):** - Common words and phrases are single tokens → shorter sequences - Fewer tokens per document → faster inference and lower cost - But: embedding table is huge (100K x 4096 = 400M parameters just for embeddings) - Rare tokens may not get enough training examples to learn good representations **Smaller vocabulary (10K tokens):** - Almost everything gets split into small pieces → very long sequences - Embedding table is compact - But: the model must learn to compose meaning from tiny fragments - Sequences are long → more compute per document **The sweet spot** for modern LLMs is typically 32K-100K tokens: - GPT-2: 50,257 tokens - GPT-4: ~100,000 tokens - Llama 2: 32,000 tokens - Claude: ~100,000 tokens **Multilingual considerations:** Models serving many languages need larger vocabularies to represent diverse scripts efficiently. A vocabulary trained mostly on English will split Chinese or Arabic text into many tiny tokens, making those languages much more expensive to process.

Step 6: Special Tokens

Beyond regular text tokens, tokenizers include **special tokens** that serve structural roles. These tokens don't correspond to actual text — they're control signals that help the model understand the structure of its input. **[CLS] (Classification):** Placed at the start of the input in BERT-style models. The model's representation of [CLS] is used as a summary of the entire sequence for classification tasks. GPT-style models don't use [CLS] — they use the last token instead. **[SEP] (Separator):** Marks boundaries between segments. In BERT, it separates sentence pairs: "[CLS] sentence A [SEP] sentence B [SEP]". Helps the model distinguish between different parts of the input. **[PAD] (Padding):** Fills sequences to equal length within a batch. GPUs process batches of fixed-size tensors, so shorter sequences are padded. The model learns to ignore [PAD] tokens via attention masks. **[MASK]:** Used only in BERT-style training. Replaces a token that the model must predict: "The [MASK] sat down" → predict "cat". GPT-style models don't use [MASK] since they predict the next token, not masked tokens. **<|endoftext|> / EOS:** Signals the end of a document or conversation turn. Critical for generation — the model learns when to stop producing text. **<|im_start|>, <|im_end|>:** Chat-specific markers for system/user/assistant message boundaries in conversational models.

Step 7: Test Your Understanding

You've learned why text needs to be tokenized, the main tokenization strategies, how BPE works, and the role of special tokens. Let's test your understanding!

Prerequisites

  • What Is an LLM (recommended)

Key Concepts

  • Character Tokenization
  • Word Tokenization
  • Subword Tokenization
  • Byte Pair Encoding (BPE)
  • Token IDs & Embeddings
  • Vocabulary Size Tradeoffs
  • Special Tokens