The original Transformer (2017) had both an encoder and a decoder. The encoder is the half that reads and understands input text.

The key property of the encoder is bidirectional attention: every token can attend to every other token, both left and right. When processing "The cat sat down":

•"cat" attends to both "The" (left) and "sat" (right)
•"sat" attends to both "cat" (left) and "down" (right)

This is different from a decoder, which can only attend to tokens to the left (more on that in the next lesson).

BERT (Bidirectional Encoder Representations from Transformers, 2018) took the encoder half and showed that bidirectional pre-training produces powerful text representations that can be fine-tuned for dozens of NLP tasks.

The attention heatmap below shows our full bidirectional attention pattern -- notice that every token can attend to every other token. There is no masking; all connections are allowed.

Property

Encoder (BERT)

Decoder (GPT)

Attention direction

Bidirectional (left + right)

Unidirectional (left only)

Each token sees

All tokens in the sequence

Only previous tokens

Best for

Understanding text (classification, NER, QA)

Generating text (completion, chat)

Masking

No attention mask (full visibility)

Causal mask (upper triangle blocked)

Pre-training task

Masked Language Modeling (fill in blanks)

Next token prediction

Encoder & BERT — Lesson Content

Learn how BERT uses bidirectional attention and masked language modeling to understand text -- the model that revolutionized NLP.

The encoder is the "understanding" half of the Transformer, and BERT showed how to pre-train it for powerful text representations. Unlike decoders that read left-to-right, encoders use bidirectional attention: every token sees every other token simultaneously. You'll learn BERT's Masked Language Modeling pre-training, the [CLS] token for classification, the pre-train/fine-tune paradigm, real-world applications (sentiment, NER, QA), and how variants like RoBERTa, ALBERT, and DistilBERT improved on the original.

Learning Objectives

Explain why bidirectional attention gives encoders richer representations
Describe the Masked Language Modeling pre-training objective
Explain the role of the [CLS] token in sequence classification
Distinguish pre-training from fine-tuning and explain why the paradigm works
Compare BERT variants (RoBERTa, ALBERT, DistilBERT) and their trade-offs

Step 1: The Encoder: Bidirectional Attention

The original Transformer (2017) had both an **encoder** and a **decoder**. The **encoder** is the half that reads and understands input text. The key property of the encoder is **bidirectional attention**: every token can attend to every other token, both left and right. When processing "The cat sat down": - "cat" attends to both "The" (left) and "sat" (right) - "sat" attends to both "cat" (left) and "down" (right) This is different from a decoder, which can only attend to tokens to the left (more on that in the next lesson). **BERT** (Bidirectional Encoder Representations from Transformers, 2018) took the encoder half and showed that bidirectional pre-training produces powerful text representations that can be fine-tuned for dozens of NLP tasks. The attention heatmap below shows our full bidirectional attention pattern -- notice that every token can attend to every other token. There is no masking; all connections are allowed.

Step 2: Masked Language Modeling: Predict [MASK]

BERT's pre-training task is **Masked Language Modeling (MLM)**: randomly mask 15% of tokens and train the model to predict them. **Example:** Input: "The [MASK] sat down" Target: "The **cat** sat down" The model must use **bidirectional context** to predict the masked word: - "The" (left context) suggests a noun follows - "sat down" (right context) suggests an animate subject This is fundamentally different from GPT's next-token prediction, where the model can only use left context. MLM forces the model to build rich **bidirectional** representations. BERT's masking strategy: - 80% of the time: replace with [MASK] token - 10% of the time: replace with a random word - 10% of the time: keep the original word The 10% random replacement prevents the model from simply learning "only predict at [MASK] positions." The 10% unchanged prevents the model from learning "[MASK] means something is missing." BERT was also trained with **Next Sentence Prediction (NSP)**: given two sentences, predict whether the second follows the first. (Later work showed NSP is less important than MLM.)

Step 3: The [CLS] Token for Classification

BERT adds a special **[CLS]** (classification) token at the beginning of every input sequence: **Input:** [CLS] The cat sat down [SEP] After passing through all 12 encoder layers, the [CLS] token's representation has attended to every other token in the sequence. Its final hidden state serves as a **summary of the entire input** -- a single vector that represents the whole sentence. For classification tasks, BERT adds a simple linear layer on top of the [CLS] representation: **prediction = softmax(W * h_[CLS] + b)** This is elegant: the [CLS] token's job during pre-training is to build a useful sequence-level representation. During fine-tuning, we just add a task-specific head. **[SEP]** (separator) tokens mark sentence boundaries. For sentence-pair tasks (e.g., natural language inference), the input looks like: [CLS] sentence_A [SEP] sentence_B [SEP] The model can attend across both sentences, learning relationships between them.

Step 4: Pre-training vs Fine-tuning

BERT introduced the **pre-train then fine-tune** paradigm that revolutionized NLP: **Phase 1: Pre-training (expensive, done once)** - Train on massive unlabeled text (Wikipedia + BookCorpus, ~3.3B words) - Task: Masked Language Modeling + Next Sentence Prediction - Hardware: 64 TPUs for 4 days - Result: General-purpose language understanding **Phase 2: Fine-tuning (cheap, done per task)** - Start from the pre-trained BERT weights - Add a small task-specific head (e.g., linear classifier) - Train on your labeled data (often just thousands of examples) - Hardware: Single GPU for a few hours - Result: State-of-the-art on your specific task This was revolutionary because: 1. **Transfer learning works:** Knowledge from pre-training transfers to downstream tasks 2. **Small data is enough:** Fine-tuning needs far fewer labeled examples than training from scratch 3. **One model, many tasks:** The same pre-trained BERT can be fine-tuned for sentiment analysis, NER, QA, etc. Before BERT, each NLP task needed its own architecture and training from scratch. BERT showed that a single pre-trained model could be adapted to almost any task.

Step 5: BERT Applications

BERT set new state-of-the-art on 11 NLP benchmarks upon release. Here are its main applications: **Sentiment Analysis:** Is this review positive or negative? - Input: [CLS] "This movie was absolutely terrible" [SEP] - Use the [CLS] token to classify sentiment **Named Entity Recognition (NER):** What are the entities in this text? - Input: "Barack Obama visited Paris in 2015" - Use each token's representation to tag: Person, Location, Date, etc. **Question Answering:** Given a question and passage, find the answer. - Input: [CLS] "Where does the cat sit?" [SEP] "The cat sat down on the mat" [SEP] - Predict the start and end positions of the answer span ("on the mat") **Semantic Similarity:** Do these two sentences mean the same thing? - Input: [CLS] "The cat sat down" [SEP] "A feline rested" [SEP] - Use [CLS] to predict similarity score **Text Classification:** Categorize text into topics, detect spam, classify intent, etc. BERT's contextual embeddings are far superior to static embeddings (Word2Vec) because the same word gets different representations based on context. "Bank" in "river bank" vs "bank account" produces different vectors.

Step 6: BERT Variants: RoBERTa, ALBERT, DistilBERT

BERT spawned a family of improved variants, each addressing different limitations: **RoBERTa** (Robustly Optimized BERT, Facebook 2019): - Trains longer on more data (160GB vs BERT's 16GB) - Removes Next Sentence Prediction (found to be unhelpful) - Uses dynamic masking (different mask each epoch, not static) - Result: Significantly better than BERT on all benchmarks **ALBERT** (A Lite BERT, Google 2019): - Factorizes embedding parameters (128-dim embeddings projected to 768) - Shares parameters across all layers (12 layers share the same weights!) - Replaces NSP with Sentence Order Prediction - Result: 18x fewer parameters with comparable performance **DistilBERT** (Hugging Face, 2019): - Knowledge distillation: train a smaller model to mimic BERT - 6 layers instead of 12, 40% smaller, 60% faster - Retains 97% of BERT's performance - Perfect for deployment where speed and size matter These variants demonstrate a key lesson: the architecture matters less than the training recipe, data quality, and model size.

Step 7: Test Your Understanding

You've learned about the encoder architecture, BERT's pre-training approach, and how it revolutionized NLP. Let's test your understanding.

Prerequisites

Self-attention mechanism
Multi-head attention
Transformer block (residual + layer norm + FFN)

Key Concepts

Bidirectional Attention
Masked Language Modeling (MLM)
[CLS] Token
Pre-training & Fine-tuning
Transfer Learning
BERT Variants

The Encoder: BERT

The Encoder: Bidirectional Attention

Bidirectional Attention: Every Token Sees Every Token

Encoder vs Decoder Architecture