The original Transformer (2017) used both an encoder and a decoder working together. This is the architecture designed for sequence-to-sequence tasks: take an input sequence and produce a different output sequence.

Encoder: Reads the input with bidirectional attention. Produces rich representations of the source text. ("Understand the input.")

Decoder: Generates the output autoregressively. Uses causal self-attention within the output sequence. ("Produce the output.")

Cross-attention: The bridge between encoder and decoder. The decoder attends to the encoder's output to access source information. ("What in the input is relevant to what I'm generating now?")

Each decoder block has three sublayers:

•Causal self-attention: Attend to previously generated output tokens
•Cross-attention: Attend to the encoder's output (the source sequence)
•Feed-forward network: Transform each token independently

This architecture is ideal for tasks where the input and output are different sequences: translation ("The cat sat" -> "Le chat assis"), summarization, or question answering with generation.

Component

Attention Type

Query From

Key/Value From

Purpose

Encoder self-attention

Bidirectional

Source tokens

Understand the input

Decoder self-attention

Causal (masked)

Target tokens

Model output dependencies

Cross-attention

Full (no mask)

Target tokens

Source tokens (encoder output)

Connect output to input

Feed-forward (encoder)

N/A

Transform encoder representations

Feed-forward (decoder)

N/A

Transform decoder representations

Task

Input (Encoder)

Output (Decoder)

Cross-Attention Role

Translation

"The cat sat"

"Le chat assis"

Align source words to target words

Summarization

Long article

Short summary

Select important parts to include

Question Answering

Question + context

Generated answer

Find relevant context for the answer

Speech-to-text

Audio features

Transcript text

Align audio frames to text tokens

Cross-Attention & Encoder-Decoder

The Encoder-Decoder Architecture

Components of the Encoder-Decoder Transformer

Sequence-to-Sequence Tasks