Feed-Forward Networks — Lesson Content
Understand why attention needs a non-linear feed-forward network and how the expand-activate-compress pattern works.
Self-attention is powerful but fundamentally linear -- it can only mix information, not transform it. The feed-forward network (FFN) adds the non-linearity that makes Transformers expressive.
You'll learn the two-layer FFN architecture (Linear -> ReLU -> Linear), why it expands to 4x the model dimension before compressing back, how it processes each token independently, and why the FFN contains 2/3 of a Transformer block's parameters. This lesson builds directly on the attention mechanism to complete the Transformer block.
Learning Objectives
- Explain why attention alone is insufficient (linear limitation)
- Describe the FFN architecture: expand, activate, compress
- Understand why the FFN expands to 4x the model dimension
- Explain per-token (position-wise) processing and its benefits
- Analyze where parameters live in a Transformer block
Step 1: Why Attention Isn't Enough
Self-attention is powerful, but it has a critical limitation: **it's fundamentally a linear operation**.
Attention computes a weighted sum of Value vectors:
**output_i = sum_j(weight_ij * V_j)**
No matter how sophisticated the attention weights are, the output is always a **linear combination** of the inputs. This means attention alone cannot learn non-linear functions -- it can only mix information, not transform it.
Consider what this means in practice:
- Attention can say "blend 40% of 'cat' with 30% of 'sat' and 30% of 'down'"
- But it **cannot** compute "if 'cat' is a noun AND 'sat' is past tense, THEN mark this as a completed action"
That kind of conditional, non-linear reasoning requires a **feed-forward network** (FFN) after attention. Together, attention + FFN form a complete Transformer block:
- **Attention** gathers relevant context from all tokens
- **FFN** transforms each token's representation using that context
Step 2: The FFN Architecture: Linear -> ReLU -> Linear
The feed-forward network in a Transformer is a simple two-layer neural network applied to each token position:
**FFN(x) = Linear_2(ReLU(Linear_1(x)))**
Or equivalently:
**FFN(x) = max(0, x * W_1 + b_1) * W_2 + b_2**
This is often called a **position-wise** feed-forward network because it processes each token independently -- the same FFN with the same weights is applied to every position.
The architecture has three key stages:
1. **Expand:** Project from model dimension (d_model) to a larger inner dimension (d_ff), typically 4x larger
2. **Activate:** Apply ReLU (or GELU in newer models) to introduce non-linearity
3. **Compress:** Project back from d_ff to d_model
The expansion to a higher dimension gives the network more "room" to learn complex transformations before compressing back.
FFN(x) = max(0, x * W_1 + b_1) * W_2 + b_2
Where:
x = input vector (d_model dimensions)
W_1 = first weight matrix (d_model x d_ff)
b_1 = first bias vector (d_ff dimensions)
W_2 = second weight matrix (d_ff x d_model)
b_2 = second bias vector (d_model dimensions)
d_ff = 4 * d_model (typically)
def feed_forward(x, W1, b1, W2, b2):
# x shape: (seq_len, d_model)
hidden = x @ W1 + b1 # Expand: (seq_len, d_ff)
hidden = relu(hidden) # Activate: (seq_len, d_ff)
output = hidden @ W2 + b2 # Compress: (seq_len, d_model)
return output
Step 3: Expansion: 8 -> 32 Dimensions
The first linear layer **expands** each token from d_model=8 dimensions to d_ff=32 dimensions (4x expansion).
Why expand? Consider this analogy: if you need to sort a messy desk, it helps to spread everything out on a bigger table first. The higher-dimensional space gives the network more room to:
- **Disentangle features** that are tangled in the smaller space
- **Create specialized neurons** that fire for specific patterns
- **Learn complex combinations** of the input features
After the linear projection, **ReLU** (Rectified Linear Unit) sets all negative values to zero:
**ReLU(x) = max(0, x)**
This is crucial -- it creates **sparse activations**. In a typical FFN, roughly 50-90% of neurons output zero after ReLU. This sparsity means each token activates only a subset of neurons, effectively selecting which "knowledge" in the FFN is relevant.
Some researchers think of FFN neurons as **key-value memories**: each neuron "fires" for specific input patterns and contributes specific information to the output.
Step 4: Compression: 32 -> 8 Dimensions
The second linear layer **compresses** back from 32 dimensions to the original 8 dimensions:
**output = hidden * W_2 + b_2**
This compression is not just dimensionality reduction -- it's a **learned synthesis**. The network must combine the activated features from the expanded space into a compact, useful representation.
Think of it like writing a summary: you spread out all the information (expansion), identified the key points (ReLU sparsity), and now you compress it into a concise representation (compression).
The output has the **same shape** as the input (4 tokens x 8 dimensions), which is essential because:
1. It allows **residual connections** (output + input must have matching dimensions)
2. The output feeds into the **next Transformer block** which expects d_model dimensions
3. Multiple blocks can be stacked without dimension mismatches
After compression, the residual connection adds the FFN output to its input:
**final = LayerNorm(x + FFN(x))**
Step 5: Per-Token Processing: Same FFN, Applied Independently
A crucial property of the FFN is that it's **position-wise** -- the exact same weights are applied to every token independently. Token 0 ("The") and token 2 ("sat") pass through the **same** FFN with the **same** W_1, W_2, b_1, b_2.
This is fundamentally different from attention, which mixes information across tokens. The FFN processes each token in isolation:
**For each token i: output_i = FFN(input_i)**
No token can see any other token during FFN processing. All cross-token communication already happened in the preceding attention layer.
Why is this a good design?
1. **Efficiency:** Processing each token independently is embarrassingly parallel
2. **Shared knowledge:** The same transformation is learned once and applied everywhere -- like a lookup table that maps patterns to meanings
3. **Complementary to attention:** Attention handles "which tokens are relevant," FFN handles "how to transform the gathered information"
Research has shown that FFN neurons often encode **factual knowledge**. A neuron might activate for "is a capital city" or "is past tense" regardless of which position the token occupies.
Step 6: Parameter Count Analysis
Let's count the parameters in a Transformer block to understand where the model's capacity lives.
For our example with d_model=8 and d_ff=32:
**Attention parameters:**
- W_Q, W_K, W_V: 3 x (8 x 8) = 192
- W_O (output projection): 8 x 8 = 64
- Total attention: **256 parameters**
**FFN parameters:**
- W_1: 8 x 32 = 256
- b_1: 32
- W_2: 32 x 8 = 256
- b_2: 8
- Total FFN: **552 parameters**
The FFN uses **more than twice** as many parameters as attention! This ratio holds at scale:
In GPT-3 (d_model=12288, d_ff=49152):
- Attention per layer: ~604M parameters
- FFN per layer: ~1.21B parameters
- **FFN is 2/3 of each layer's parameters**
This is why some researchers call FFNs the "memory" of the Transformer -- they store the vast majority of the model's learned knowledge.
Step 7: Test Your Understanding
You've learned how feed-forward networks complement attention in the Transformer. Let's test your understanding of why FFNs are essential, how they work, and where the parameters live.
Prerequisites
- Self-attention mechanism
- Neural network basics (linear layers, activations)
- Matrix multiplication
Key Concepts
- Position-wise Feed-Forward Network
- ReLU / GELU Activation
- Dimension Expansion and Compression
- Sparse Activations
- Parameter Distribution