Transformer Architecture

The architecture behind GPT, BERT, and modern AI

Beginner to Intermediate

10 of 10 lessons available

3-4 hours total

Attention Basics

Build up the attention mechanism step by step

Start Here

Beginner10-12 min

Self-Attention: The Core Idea

Understand why every token needs to see every other token — and how attention makes this possible.

AttentionParallel ProcessingLong-Range Dependencies

Start Learning

Intermediate12-15 min

Query, Key, Value Projections

Learn the three projections that drive attention — like searching a library with questions, titles, and content.

QueryKeyValueLinear Projections

Explore

Intermediate12-15 min

Softmax & Attention Output

See how dot products become probabilities and how tokens blend information from the most relevant neighbors.

Dot Product ScoringSoftmaxWeighted SumAttention Heatmaps

Explore

Extending Attention

Add multi-head diversity and position awareness

Intermediate10-12 min

Multi-Head Attention

See how multiple attention heads capture different types of relationships simultaneously.

Multiple HeadsHead SpecializationConcatenation

Explore

Intermediate10-12 min

Positional Encoding

Discover how sinusoidal waves inject order information into the Transformer's parallel world.

Position InformationSinusoidal EncodingLearned Positions

Explore

The Full Block

Complete the Transformer block architecture

Intermediate8-10 min

Feed-Forward Networks

Learn why attention needs a non-linear partner — and how expansion/compression transforms each token.

Non-linearityExpansionPer-Token Processing

Explore

Intermediate10-12 min

Residual Connections & Layer Norm

See how skip connections and normalization enable stacking 96+ Transformer blocks deep.

Skip ConnectionsLayer NormalizationDeep Networks

Explore

Architecture Variants

How attention powers BERT, GPT, and translation

Intermediate12-15 min

The Encoder: BERT

Explore bidirectional attention and masked language modeling — the architecture behind BERT.

Bidirectional AttentionMasked LM[CLS] TokenFine-tuning

Explore

Intermediate12-15 min

The Decoder: GPT

See how causal masking enables autoregressive text generation — the architecture behind ChatGPT.

Causal MaskingAutoregressiveTemperatureRLHF

Explore

Intermediate12-15 min

Cross-Attention & Encoder-Decoder

Learn how the encoder feeds into the decoder through cross-attention — powering translation and T5.

Cross-AttentionEncoder-DecoderTranslationT5

Explore

Continue Your Journey

After mastering Transformers, see what they power

Explore Large Language Models