Positional Encoding

Injecting Order Into Parallel Processing

Difficulty

Intermediate

Duration

10-12 min

Prerequisites

Self-Attention

Step

1/ 7

Why Word Order Matters

Self-attention has a remarkable property: it's permutation invariant. If you scramble the order of tokens in a sequence, the attention mechanism produces the same outputs (just reordered). It treats the input as a set, not a sequence.

But word order is crucial for meaning! Consider these scrambled versions of our sentence:

| Original | Scrambled | Same meaning? | |---|---|---| | "The cat sat down" | "down sat cat The" | No! | | "Dog bites man" | "Man bites dog" | Opposite meaning! | | "I saw her duck" | "Duck her saw I" | Nonsensical |

Without positional information, self-attention cannot distinguish "The cat sat down" from "down sat The cat" — the same attention weights would be computed regardless of order.

The solution: explicitly inject position information into the token representations. Before the first attention layer, we add a positional encoding to each token's embedding. This breaks the permutation invariance and lets the model know where each token sits in the sequence.

The original Transformer paper (Vaswani et al., 2017) used sinusoidal positional encodings — elegant mathematical functions that encode position without any learned parameters.

The Scrambled Sentence Problem

Sentence	Order	Meaning	Without Position Info
"The cat sat down"	Original	A cat sits down	Same attention as any permutation
"down sat cat The"	Reversed	Nonsensical	Identical attention scores!
"cat The down sat"	Shuffled	Nonsensical	Identical attention scores!
"Dog bites man"	Original	Dog attacks man	Same as "Man bites dog"
"Man bites dog"	Swapped	Man attacks dog	Same as "Dog bites man"

Approaches to Encoding Position

Approach	How Position Is Encoded	Used By
No encoding	Model is blind to order	Bag-of-words models
Sinusoidal	Fixed mathematical functions of position	Original Transformer (2017)
Learned	Trainable embedding per position	BERT, GPT-2
Relative (RoPE)	Encode distance between tokens, not absolute position	LLaMA, GPT-NeoX
ALiBi	Bias attention scores by distance	BLOOM, MPT

Next: Feed-Forward Networks

Positional Encoding — Lesson Content

Discover why transformers need position information and how sinusoidal, learned, and relative encodings inject word order into the model.

Self-attention is permutation invariant — without explicit position information, "The cat sat down" and "down sat cat The" would produce identical outputs. Positional encodings solve this by injecting order information. You'll explore sinusoidal encodings with their elegant multi-frequency design, compare them to learned embeddings, and understand modern relative position methods like RoPE and ALiBi that power today's large language models.

Learning Objectives

Explain why position information is needed in transformers
Compute sinusoidal positional encodings from the formula
Understand the role of different frequencies across dimensions
Compare sinusoidal, learned, and relative position encodings
Describe modern approaches like RoPE and ALiBi

Step 1: Why Word Order Matters

Self-attention has a remarkable property: it's **permutation invariant**. If you scramble the order of tokens in a sequence, the attention mechanism produces the same outputs (just reordered). It treats the input as a **set**, not a sequence. But word order is crucial for meaning! Consider these scrambled versions of our sentence: | Original | Scrambled | Same meaning? | |---|---|---| | "The cat sat down" | "down sat cat The" | No! | | "Dog bites man" | "Man bites dog" | Opposite meaning! | | "I saw her duck" | "Duck her saw I" | Nonsensical | Without positional information, self-attention cannot distinguish "The cat sat down" from "down sat The cat" — the same attention weights would be computed regardless of order. **The solution:** explicitly inject position information into the token representations. Before the first attention layer, we add a **positional encoding** to each token's embedding. This breaks the permutation invariance and lets the model know where each token sits in the sequence. The original Transformer paper (Vaswani et al., 2017) used **sinusoidal positional encodings** — elegant mathematical functions that encode position without any learned parameters.

Step 2: The Sinusoidal Encoding Formula

The original Transformer uses sine and cosine functions at different frequencies to encode position: **PE(pos, 2i) = sin(pos / 10000^(2i/d_model))** **PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))** Where: - **pos** = position in the sequence (0, 1, 2, ...) - **i** = dimension index (0, 1, 2, ..., d_model/2 - 1) - **d_model** = embedding dimension (8 in our example) Even dimensions (0, 2, 4, 6) use **sine**, odd dimensions (1, 3, 5, 7) use **cosine**. For our sentence with d_model = 8: - Position 0 ("The"): PE = [sin(0), cos(0), sin(0), cos(0), ...] = [0, 1, 0, 1, ...] - Position 1 ("cat"): PE = [sin(1), cos(1), sin(0.06), cos(0.06), ...] **Why sin/cos?** These functions have a beautiful property: the encoding for position pos+k can be expressed as a **linear function** of the encoding for position pos. This means the model can easily learn to attend to relative positions (e.g., "the word 3 positions back") using simple linear transformations.

PE(pos, 2i)   = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))

For d_model = 8, i ∈ {0, 1, 2, 3}:
  dim 0: sin(pos / 10000^(0/8)) = sin(pos)
  dim 1: cos(pos / 10000^(0/8)) = cos(pos)
  dim 2: sin(pos / 10000^(2/8)) = sin(pos / 5.62)
  dim 3: cos(pos / 10000^(2/8)) = cos(pos / 5.62)
  ...

Step 3: Different Frequencies for Different Dimensions

The key insight of sinusoidal encodings is that different dimensions oscillate at **different frequencies**: - **Low dimensions (0, 1):** High frequency — the values change rapidly with each position. These capture **fine-grained** position differences (adjacent tokens look very different). - **High dimensions (6, 7):** Low frequency — the values change slowly across positions. These capture **coarse** position information (nearby tokens look similar, but tokens far apart look different). This multi-frequency representation is analogous to how we write time: "2:30:45 PM" uses hours (low frequency), minutes (medium), and seconds (high frequency). Any specific time has a unique combination of values across all frequencies. Similarly, each position in the sequence gets a **unique combination of sin/cos values** across all dimensions. The model can use different frequencies for different purposes: - High-frequency dims: distinguish adjacent tokens (important for local syntax) - Low-frequency dims: distinguish distant regions (important for document structure) The chart below shows four encoding curves (two frequencies × sin/cos). Notice how dims 0-1 oscillate rapidly while dims 2-3 change more slowly.

Step 4: Visualizing the Encoding Matrix

Let's look at the complete positional encoding matrix for 8 positions and 8 dimensions. Each row is one position's encoding — a unique 8-dimensional vector. **Key observations:** 1. **Each row is unique.** No two positions have the same encoding. This is essential — the model needs to distinguish every position. 2. **Column patterns.** Each column (dimension) shows a sinusoidal wave at a different frequency. Left columns change fast, right columns change slowly. 3. **Nearby positions are similar.** Positions 2 and 3 have more similar encodings than positions 2 and 7. This gives the model a notion of distance. 4. **Addition to embeddings.** These encodings are **added** (element-wise) to the token embeddings: input = embedding + positional_encoding. This means position information is mixed directly into the representation. The addition approach works because the embedding dimensions are high enough to accommodate both semantic and positional information without destructive interference. The model learns to "read" both signals from the combined vector.

Step 5: Learned vs Sinusoidal Encodings

The original Transformer used fixed sinusoidal encodings, but many modern models use **learned positional embeddings** instead. Let's compare: **Sinusoidal (Fixed)** - Computed from a formula — no parameters to train - Can theoretically generalize to sequence lengths longer than training - The sin/cos linear relationship means relative positions are easy to learn - Used by: Original Transformer, some smaller models **Learned** - Each position gets its own trainable embedding vector (just like word embeddings) - More flexible — can learn arbitrary position patterns - Cannot generalize beyond the maximum trained sequence length - Used by: BERT, GPT-2, GPT-3 **The surprising finding:** In practice, learned and sinusoidal encodings perform almost identically on standard benchmarks. The original paper showed no significant difference. The choice often depends on whether you need to generalize to longer sequences (sinusoidal) or want maximum flexibility (learned). Modern architectures have moved beyond both approaches toward **relative position encodings** that encode the distance between tokens rather than their absolute positions. We'll cover these next.

Step 6: Relative Position Encodings

A fundamental limitation of both sinusoidal and learned encodings is that they encode **absolute position** — "I am at position 5." But language understanding often depends on **relative position** — "this word is 3 positions before that word." **Relative position encodings** address this directly. Instead of adding position to the input, they modify the attention computation itself based on the distance between tokens. **RoPE (Rotary Position Embedding)** — used by LLaMA, Mistral, and many modern LLMs: - Rotates Q and K vectors by an angle proportional to their position - The dot product Q·K naturally becomes a function of the **distance** between tokens - Elegant: position information is baked into the attention score, not the input **ALiBi (Attention with Linear Biases)** — used by BLOOM and MPT: - Adds a simple bias to attention scores: bias = -m × |i - j| - Closer tokens get higher scores, farther tokens get lower scores - No additional parameters — just a distance penalty per head - Different heads use different slopes m, so some heads focus locally and others globally These approaches have proven superior for **length generalization** — models trained on short sequences can handle much longer sequences at inference time.

Step 7: Test Your Understanding

You've learned why position information is essential and how different encoding schemes inject it into the transformer. Let's test your understanding!

Prerequisites

Self-attention basics
Sine and cosine functions
Word embeddings

Key Concepts

Positional Encoding
Sinusoidal Functions
Multi-Frequency Representation
Learned Embeddings
Relative Position Encoding
RoPE & ALiBi