Back to AI/ML Overview

Transformer Architecture

The architecture behind GPT, BERT, and modern AI

Beginner to Intermediate
10 of 10 lessons available
3-4 hours total
1

Attention Basics

Build up the attention mechanism step by step

Start Here
Beginner10-12 min

Self-Attention: The Core Idea

Understand why every token needs to see every other token — and how attention makes this possible.

AttentionParallel ProcessingLong-Range Dependencies
Start Learning
Intermediate12-15 min

Query, Key, Value Projections

Learn the three projections that drive attention — like searching a library with questions, titles, and content.

QueryKeyValueLinear Projections
Explore
Intermediate12-15 min

Softmax & Attention Output

See how dot products become probabilities and how tokens blend information from the most relevant neighbors.

Dot Product ScoringSoftmaxWeighted SumAttention Heatmaps
Explore
2

Extending Attention

Add multi-head diversity and position awareness

Intermediate10-12 min

Multi-Head Attention

See how multiple attention heads capture different types of relationships simultaneously.

Multiple HeadsHead SpecializationConcatenation
Explore
Intermediate10-12 min

Positional Encoding

Discover how sinusoidal waves inject order information into the Transformer's parallel world.

Position InformationSinusoidal EncodingLearned Positions
Explore
3

The Full Block

Complete the Transformer block architecture

Intermediate8-10 min

Feed-Forward Networks

Learn why attention needs a non-linear partner — and how expansion/compression transforms each token.

Non-linearityExpansionPer-Token Processing
Explore
Intermediate10-12 min

Residual Connections & Layer Norm

See how skip connections and normalization enable stacking 96+ Transformer blocks deep.

Skip ConnectionsLayer NormalizationDeep Networks
Explore
4

Architecture Variants

How attention powers BERT, GPT, and translation

Intermediate12-15 min

The Encoder: BERT

Explore bidirectional attention and masked language modeling — the architecture behind BERT.

Bidirectional AttentionMasked LM[CLS] TokenFine-tuning
Explore
Intermediate12-15 min

The Decoder: GPT

See how causal masking enables autoregressive text generation — the architecture behind ChatGPT.

Causal MaskingAutoregressiveTemperatureRLHF
Explore
Intermediate12-15 min

Cross-Attention & Encoder-Decoder

Learn how the encoder feeds into the decoder through cross-attention — powering translation and T5.

Cross-AttentionEncoder-DecoderTranslationT5
Explore

Continue Your Journey

After mastering Transformers, see what they power