Cross-Attention & Encoder-Decoder
Bridging Understanding and Generation
The Encoder-Decoder Architecture
The original Transformer (2017) used both an encoder and a decoder working together. This is the architecture designed for sequence-to-sequence tasks: take an input sequence and produce a different output sequence.
Encoder: Reads the input with bidirectional attention. Produces rich representations of the source text. ("Understand the input.")
Decoder: Generates the output autoregressively. Uses causal self-attention within the output sequence. ("Produce the output.")
Cross-attention: The bridge between encoder and decoder. The decoder attends to the encoder's output to access source information. ("What in the input is relevant to what I'm generating now?")
Each decoder block has three sublayers:
- •Causal self-attention: Attend to previously generated output tokens
- •Cross-attention: Attend to the encoder's output (the source sequence)
- •Feed-forward network: Transform each token independently
This architecture is ideal for tasks where the input and output are different sequences: translation ("The cat sat" -> "Le chat assis"), summarization, or question answering with generation.
Components of the Encoder-Decoder Transformer
| Component | Attention Type | Query From | Key/Value From | Purpose |
|---|---|---|---|---|
| Encoder self-attention | Bidirectional | Source tokens | Source tokens | Understand the input |
| Decoder self-attention | Causal (masked) | Target tokens | Target tokens | Model output dependencies |
| Cross-attention | Full (no mask) | Target tokens | Source tokens (encoder output) | Connect output to input |
| Feed-forward (encoder) | N/A | N/A | N/A | Transform encoder representations |
| Feed-forward (decoder) | N/A | N/A | N/A | Transform decoder representations |
Sequence-to-Sequence Tasks
| Task | Input (Encoder) | Output (Decoder) | Cross-Attention Role |
|---|---|---|---|
| Translation | "The cat sat" | "Le chat assis" | Align source words to target words |
| Summarization | Long article | Short summary | Select important parts to include |
| Question Answering | Question + context | Generated answer | Find relevant context for the answer |
| Speech-to-text | Audio features | Transcript text | Align audio frames to text tokens |