Self-attention is powerful, but it has a critical limitation: it's fundamentally a linear operation.

Attention computes a weighted sum of Value vectors:

output_i = sum_j(weight_ij * V_j)

No matter how sophisticated the attention weights are, the output is always a linear combination of the inputs. This means attention alone cannot learn non-linear functions -- it can only mix information, not transform it.

Consider what this means in practice:

•Attention can say "blend 40% of 'cat' with 30% of 'sat' and 30% of 'down'"
•But it cannot compute "if 'cat' is a noun AND 'sat' is past tense, THEN mark this as a completed action"

That kind of conditional, non-linear reasoning requires a feed-forward network (FFN) after attention. Together, attention + FFN form a complete Transformer block:

•Attention gathers relevant context from all tokens
•FFN transforms each token's representation using that context

Property

Self-Attention

Feed-Forward Network

Operation type

Linear (weighted sum)

Non-linear (ReLU activation)

Scope

Cross-token (mixes information between tokens)

Per-token (processes each token independently)

Purpose

Gather context: "what is relevant?"

Transform: "what to do with the context?"

Parameters

W_Q, W_K, W_V, W_O

W_1, b_1, W_2, b_2

Analogy

Reading relevant paragraphs

Reasoning about what you read

Limitation

Example

Why FFN Fixes It

No non-linearity

Cannot learn XOR-like patterns

ReLU activation enables non-linear decision boundaries

No per-token transformation

Cannot independently process each position

FFN applies same transformation to each token separately

Limited expressiveness

Linear combinations cannot approximate arbitrary functions

Two-layer FFN is a universal function approximator

Feed-Forward Networks — Lesson Content

Understand why attention needs a non-linear feed-forward network and how the expand-activate-compress pattern works.

Self-attention is powerful but fundamentally linear -- it can only mix information, not transform it. The feed-forward network (FFN) adds the non-linearity that makes Transformers expressive. You'll learn the two-layer FFN architecture (Linear -> ReLU -> Linear), why it expands to 4x the model dimension before compressing back, how it processes each token independently, and why the FFN contains 2/3 of a Transformer block's parameters. This lesson builds directly on the attention mechanism to complete the Transformer block.

Learning Objectives

Explain why attention alone is insufficient (linear limitation)
Describe the FFN architecture: expand, activate, compress
Understand why the FFN expands to 4x the model dimension
Explain per-token (position-wise) processing and its benefits
Analyze where parameters live in a Transformer block

Step 1: Why Attention Isn't Enough

Self-attention is powerful, but it has a critical limitation: **it's fundamentally a linear operation**. Attention computes a weighted sum of Value vectors: **output_i = sum_j(weight_ij * V_j)** No matter how sophisticated the attention weights are, the output is always a **linear combination** of the inputs. This means attention alone cannot learn non-linear functions -- it can only mix information, not transform it. Consider what this means in practice: - Attention can say "blend 40% of 'cat' with 30% of 'sat' and 30% of 'down'" - But it **cannot** compute "if 'cat' is a noun AND 'sat' is past tense, THEN mark this as a completed action" That kind of conditional, non-linear reasoning requires a **feed-forward network** (FFN) after attention. Together, attention + FFN form a complete Transformer block: - **Attention** gathers relevant context from all tokens - **FFN** transforms each token's representation using that context

Step 2: The FFN Architecture: Linear -> ReLU -> Linear

The feed-forward network in a Transformer is a simple two-layer neural network applied to each token position: **FFN(x) = Linear_2(ReLU(Linear_1(x)))** Or equivalently: **FFN(x) = max(0, x * W_1 + b_1) * W_2 + b_2** This is often called a **position-wise** feed-forward network because it processes each token independently -- the same FFN with the same weights is applied to every position. The architecture has three key stages: 1. **Expand:** Project from model dimension (d_model) to a larger inner dimension (d_ff), typically 4x larger 2. **Activate:** Apply ReLU (or GELU in newer models) to introduce non-linearity 3. **Compress:** Project back from d_ff to d_model The expansion to a higher dimension gives the network more "room" to learn complex transformations before compressing back.

FFN(x) = max(0, x * W_1 + b_1) * W_2 + b_2

Where:
  x      = input vector (d_model dimensions)
  W_1    = first weight matrix (d_model x d_ff)
  b_1    = first bias vector (d_ff dimensions)
  W_2    = second weight matrix (d_ff x d_model)
  b_2    = second bias vector (d_model dimensions)
  d_ff   = 4 * d_model (typically)

def feed_forward(x, W1, b1, W2, b2):
    # x shape: (seq_len, d_model)
    hidden = x @ W1 + b1       # Expand:   (seq_len, d_ff)
    hidden = relu(hidden)       # Activate: (seq_len, d_ff)
    output = hidden @ W2 + b2   # Compress: (seq_len, d_model)
    return output

Step 3: Expansion: 8 -> 32 Dimensions

The first linear layer **expands** each token from d_model=8 dimensions to d_ff=32 dimensions (4x expansion). Why expand? Consider this analogy: if you need to sort a messy desk, it helps to spread everything out on a bigger table first. The higher-dimensional space gives the network more room to: - **Disentangle features** that are tangled in the smaller space - **Create specialized neurons** that fire for specific patterns - **Learn complex combinations** of the input features After the linear projection, **ReLU** (Rectified Linear Unit) sets all negative values to zero: **ReLU(x) = max(0, x)** This is crucial -- it creates **sparse activations**. In a typical FFN, roughly 50-90% of neurons output zero after ReLU. This sparsity means each token activates only a subset of neurons, effectively selecting which "knowledge" in the FFN is relevant. Some researchers think of FFN neurons as **key-value memories**: each neuron "fires" for specific input patterns and contributes specific information to the output.

Step 4: Compression: 32 -> 8 Dimensions

The second linear layer **compresses** back from 32 dimensions to the original 8 dimensions: **output = hidden * W_2 + b_2** This compression is not just dimensionality reduction -- it's a **learned synthesis**. The network must combine the activated features from the expanded space into a compact, useful representation. Think of it like writing a summary: you spread out all the information (expansion), identified the key points (ReLU sparsity), and now you compress it into a concise representation (compression). The output has the **same shape** as the input (4 tokens x 8 dimensions), which is essential because: 1. It allows **residual connections** (output + input must have matching dimensions) 2. The output feeds into the **next Transformer block** which expects d_model dimensions 3. Multiple blocks can be stacked without dimension mismatches After compression, the residual connection adds the FFN output to its input: **final = LayerNorm(x + FFN(x))**

Step 5: Per-Token Processing: Same FFN, Applied Independently

A crucial property of the FFN is that it's **position-wise** -- the exact same weights are applied to every token independently. Token 0 ("The") and token 2 ("sat") pass through the **same** FFN with the **same** W_1, W_2, b_1, b_2. This is fundamentally different from attention, which mixes information across tokens. The FFN processes each token in isolation: **For each token i: output_i = FFN(input_i)** No token can see any other token during FFN processing. All cross-token communication already happened in the preceding attention layer. Why is this a good design? 1. **Efficiency:** Processing each token independently is embarrassingly parallel 2. **Shared knowledge:** The same transformation is learned once and applied everywhere -- like a lookup table that maps patterns to meanings 3. **Complementary to attention:** Attention handles "which tokens are relevant," FFN handles "how to transform the gathered information" Research has shown that FFN neurons often encode **factual knowledge**. A neuron might activate for "is a capital city" or "is past tense" regardless of which position the token occupies.

Step 6: Parameter Count Analysis

Let's count the parameters in a Transformer block to understand where the model's capacity lives. For our example with d_model=8 and d_ff=32: **Attention parameters:** - W_Q, W_K, W_V: 3 x (8 x 8) = 192 - W_O (output projection): 8 x 8 = 64 - Total attention: **256 parameters** **FFN parameters:** - W_1: 8 x 32 = 256 - b_1: 32 - W_2: 32 x 8 = 256 - b_2: 8 - Total FFN: **552 parameters** The FFN uses **more than twice** as many parameters as attention! This ratio holds at scale: In GPT-3 (d_model=12288, d_ff=49152): - Attention per layer: ~604M parameters - FFN per layer: ~1.21B parameters - **FFN is 2/3 of each layer's parameters** This is why some researchers call FFNs the "memory" of the Transformer -- they store the vast majority of the model's learned knowledge.

Step 7: Test Your Understanding

You've learned how feed-forward networks complement attention in the Transformer. Let's test your understanding of why FFNs are essential, how they work, and where the parameters live.

Prerequisites

Self-attention mechanism
Neural network basics (linear layers, activations)
Matrix multiplication

Key Concepts

Position-wise Feed-Forward Network
ReLU / GELU Activation
Dimension Expansion and Compression
Sparse Activations
Parameter Distribution

Feed-Forward Networks

Why Attention Isn't Enough

Attention vs Feed-Forward Network

Why Linear Attention Needs Non-Linear FFN