Cross-Attention & Encoder-Decoder — Lesson Content
Learn how cross-attention connects encoders and decoders for translation, summarization, and sequence-to-sequence tasks.
Cross-attention is the bridge between the encoder and decoder in the original Transformer. While self-attention lets tokens attend to their own sequence, cross-attention lets the decoder attend to the encoder's output -- enabling the model to "look back" at the source while generating the target.
You'll trace cross-attention through a translation example ("The cat sat" to "Le chat assis"), see how word alignment emerges automatically, learn about T5's text-to-text paradigm, compare the three Transformer architectures (encoder-only, decoder-only, encoder-decoder), and explore modern trends including multimodal models and mixture of experts.
Learning Objectives
- Explain how cross-attention differs from self-attention (Q from decoder, K/V from encoder)
- Trace cross-attention through a translation example and interpret the alignment
- Describe T5's text-to-text paradigm and its advantages
- Compare encoder-only, decoder-only, and encoder-decoder architectures
- Identify modern applications of cross-attention beyond text translation
Step 1: The Encoder-Decoder Architecture
The original Transformer (2017) used **both** an encoder and a decoder working together. This is the architecture designed for **sequence-to-sequence** tasks: take an input sequence and produce a different output sequence.
**Encoder:** Reads the input with bidirectional attention. Produces rich representations of the source text. ("Understand the input.")
**Decoder:** Generates the output autoregressively. Uses causal self-attention within the output sequence. ("Produce the output.")
**Cross-attention:** The bridge between encoder and decoder. The decoder attends to the encoder's output to access source information. ("What in the input is relevant to what I'm generating now?")
Each decoder block has **three** sublayers:
1. **Causal self-attention:** Attend to previously generated output tokens
2. **Cross-attention:** Attend to the encoder's output (the source sequence)
3. **Feed-forward network:** Transform each token independently
This architecture is ideal for tasks where the input and output are different sequences: translation ("The cat sat" -> "Le chat assis"), summarization, or question answering with generation.
Step 2: How Cross-Attention Works: Q from Decoder, K/V from Encoder
Cross-attention is structurally the same as self-attention, with one crucial difference: the **Query** comes from the decoder, but the **Key** and **Value** come from the encoder.
**Self-attention:** Q, K, V all come from the same sequence
**Cross-attention:** Q from decoder tokens, K and V from encoder output
The formula is identical:
**CrossAttention(Q_dec, K_enc, V_enc) = softmax(Q_dec * K_enc^T / sqrt(d_k)) * V_enc**
What this means intuitively:
- **Q (from decoder):** "I'm generating the French word 'chat'. What English words should I look at?"
- **K (from encoder):** "Here's what each English word represents."
- **V (from encoder):** "Here's the information each English word carries."
The decoder token "chat" generates a query, compares it against all encoder keys, and takes a weighted combination of encoder values. The high attention between "chat" and "cat" means the decoder is using the encoder's representation of "cat" to help generate "chat."
Note: The encoder's output is computed **once** and reused across all decoder time steps. This is efficient -- the encoder doesn't re-run for each generated token.
Cross-Attention(Q_dec, K_enc, V_enc):
Q = decoder_hidden @ W_Q (queries from decoder)
K = encoder_output @ W_K (keys from encoder)
V = encoder_output @ W_V (values from encoder)
scores = Q @ K^T / sqrt(d_k)
weights = softmax(scores)
output = weights @ V
Key difference from self-attention:
Self: Q, K, V all from same source
Cross: Q from decoder, K/V from encoder
def cross_attention(decoder_hidden, encoder_output,
W_Q, W_K, W_V):
Q = decoder_hidden @ W_Q # from decoder
K = encoder_output @ W_K # from encoder
V = encoder_output @ W_V # from encoder
scores = Q @ K.T / sqrt(d_k)
weights = softmax(scores) # (target_len, source_len)
output = weights @ V # (target_len, d_model)
return output
Step 3: Translation: "The cat sat" -> "Le chat assis"
Let's trace cross-attention through a concrete translation example.
**Encoder** processes the English source: "The cat sat"
- Each English token gets a rich bidirectional representation
- The encoder output is a (3 x d_model) matrix
**Decoder** generates the French target: "Le chat assis"
- Step 1: Generate "Le" -- cross-attention looks at the encoder output
- "Le" strongly attends to "The" (82%) -- article aligns with article
- Step 2: Generate "chat" -- cross-attention again queries the encoder
- "chat" strongly attends to "cat" (85%) -- noun aligns with its translation
- Step 3: Generate "assis" -- same mechanism
- "assis" strongly attends to "sat" (80%) -- verb aligns with its translation
The cross-attention weights below show this alignment pattern. Each row is a French (target) token, each column is an English (source) token. The bright diagonal shows that the model has learned word-level alignment as an emergent property -- it was never explicitly told which words correspond!
This automatic alignment discovery is one of the most elegant properties of cross-attention.
Step 4: The T5 Model: Text-to-Text
**T5** (Text-to-Text Transfer Transformer, Google 2019) is the most influential encoder-decoder model. Its key insight: **frame every NLP task as text-to-text**.
Instead of task-specific heads (like BERT's classifier on [CLS]), T5 always generates text as output:
**Translation:** "translate English to French: The cat sat" -> "Le chat assis"
**Summarization:** "summarize: Long article text..." -> "Brief summary"
**Classification:** "classify sentiment: Great movie!" -> "positive"
**Question Answering:** "question: Where? context: The cat sat on the mat" -> "on the mat"
This unified format means the same model architecture, loss function, and training pipeline handles every task. The task is specified by a text **prefix** in the input.
T5 was trained on the **C4 dataset** (Colossal Clean Crawled Corpus) -- 750GB of cleaned web text. It systematically compared architectural choices and found:
- Encoder-decoder slightly outperforms decoder-only for many tasks
- Larger models consistently perform better
- More pre-training data helps
Step 5: Comparing Architectures
The Transformer family has three main architectures. Each has strengths suited to different tasks:
**Encoder-only (BERT, RoBERTa):**
- Bidirectional attention, no generation
- Best for: classification, NER, semantic similarity, retrieval
- Pre-training: Masked Language Modeling
- Produces rich representations but cannot generate text natively
**Decoder-only (GPT, LLaMA, Claude):**
- Causal attention, autoregressive generation
- Best for: text generation, chat, code, reasoning
- Pre-training: Next Token Prediction
- Currently dominant due to simplicity and scaling properties
**Encoder-Decoder (T5, BART, mBART):**
- Bidirectional encoder + causal decoder + cross-attention
- Best for: translation, summarization, seq-to-seq tasks
- Pre-training: Various (span corruption, denoising)
- Most faithful to the original Transformer design
A surprising finding: at sufficient scale, **decoder-only models can match encoder-decoder models** on many seq-to-seq tasks, simply by concatenating input and output in a single sequence. This is why the industry has converged on decoder-only architectures for general-purpose models.
Step 6: Modern Trends & Applications
The Transformer has expanded far beyond text NLP. Here's where the field is heading:
**Decoder-only dominance:** GPT-4, Claude, Gemini, and LLaMA are all decoder-only. At sufficient scale, they can handle translation and summarization (traditionally encoder-decoder tasks) by treating everything as text generation.
**Multimodal Transformers:** Models now process multiple modalities:
- **GPT-4V / Gemini:** Text + images
- **Whisper:** Audio -> text (encoder-decoder with cross-attention)
- **DALL-E / Stable Diffusion:** Text -> images (use cross-attention between text encoder and image decoder)
- **Video models:** Text + video frames
**Mixture of Experts (MoE):** Instead of one large FFN, use many smaller "expert" FFNs with a router that selects which experts to activate. GPT-4 and Mixtral use this to scale parameters without proportionally scaling compute.
**Long-context models:** Extending from 2K tokens (original Transformer) to 128K-1M tokens through techniques like rotary position embeddings, flash attention, and ring attention.
**Cross-attention lives on:** Even in a decoder-only world, cross-attention appears in:
- Multimodal models (attend to image features)
- Retrieval-augmented generation (attend to retrieved documents)
- Tool-use models (attend to tool outputs)
Step 7: Test Your Understanding
You've learned how cross-attention bridges encoders and decoders, enabling powerful sequence-to-sequence models. Let's test your understanding.
Prerequisites
- Self-attention mechanism
- Encoder architecture (BERT lesson)
- Decoder architecture (GPT lesson)
Key Concepts
- Cross-Attention
- Encoder-Decoder Architecture
- Word Alignment
- Text-to-Text (T5)
- Architecture Comparison
- Multimodal Transformers