Transformer Architecture
The architecture behind GPT, BERT, and modern AI
Attention Basics
Build up the attention mechanism step by step
Self-Attention: The Core Idea
Understand why every token needs to see every other token — and how attention makes this possible.
Query, Key, Value Projections
Learn the three projections that drive attention — like searching a library with questions, titles, and content.
Softmax & Attention Output
See how dot products become probabilities and how tokens blend information from the most relevant neighbors.
Extending Attention
Add multi-head diversity and position awareness
Multi-Head Attention
See how multiple attention heads capture different types of relationships simultaneously.
Positional Encoding
Discover how sinusoidal waves inject order information into the Transformer's parallel world.
The Full Block
Complete the Transformer block architecture
Feed-Forward Networks
Learn why attention needs a non-linear partner — and how expansion/compression transforms each token.
Residual Connections & Layer Norm
See how skip connections and normalization enable stacking 96+ Transformer blocks deep.
Architecture Variants
How attention powers BERT, GPT, and translation
The Encoder: BERT
Explore bidirectional attention and masked language modeling — the architecture behind BERT.
The Decoder: GPT
See how causal masking enables autoregressive text generation — the architecture behind ChatGPT.
Cross-Attention & Encoder-Decoder
Learn how the encoder feeds into the decoder through cross-attention — powering translation and T5.
Continue Your Journey
After mastering Transformers, see what they power