Back to LLMs
Next: RLHF

Fine-tuning

Teaching New Tricks

Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Pre-training
Step
1/ 7

Why Pre-training Isn't Enough

A pre-trained LLM like GPT or LLaMA has learned language from trillions of tokens scraped from the internet. It can predict the next token brilliantly. But it has a fundamental problem: it doesn't know what you want it to do.

Ask a pre-trained model "What is the capital of France?" and it might continue with:

  • "What is the capital of Germany? What is the capital of..." (it sees a pattern of questions)
  • "...a question often asked by geography students" (it describes the question)
  • "Paris" (what you actually wanted — but only sometimes)

The model is a text completion engine, not an assistant. It hasn't learned to:

  • Follow instructions ("Summarize this article in 3 bullet points")
  • Stay on topic (instead of generating tangentially related text)
  • Refuse harmful requests ("How to build a bomb" → should decline)
  • Match a specific tone, format, or domain

Fine-tuning bridges this gap: you take the pre-trained model and continue training it on a curated dataset that teaches the specific behavior you want. The model keeps its broad language knowledge but learns new skills on top.

Pre-trained vs Fine-tuned Model

CapabilityPre-trained ModelFine-tuned Model
Language understandingExcellentExcellent (retained)
Following instructionsPoor — just completes textGood — trained on instruction-response pairs
Specific domain knowledgeGeneral onlyDeep (if fine-tuned on domain data)
Output format controlUnpredictableConsistent (JSON, markdown, etc.)
Safety / alignmentNone — will generate anythingTrained to refuse harmful requests
Tone / personalityMimics training data mixMatches desired persona

Pre-training vs Fine-tuning Analogies

AnalogyPre-trainingFine-tuning
EducationGeneral education (K-12)Medical school specialization
CookingLearning all cuisinesMastering French pastry
MusicLearning to play instrumentsRehearsing for a specific concert
LanguageLearning vocabulary and grammarLearning legal or medical writing

Fine-Tuning: From Base Model to Assistant — Lesson Content

Learn how supervised fine-tuning, instruction tuning, and parameter-efficient methods like LoRA transform a pre-trained LLM into a useful assistant.

Pre-trained LLMs are text completion engines — they don't inherently know how to follow instructions or be helpful. Fine-tuning bridges this gap by continuing training on curated data that teaches specific behaviors. You'll learn the full spectrum from supervised fine-tuning (SFT) to instruction tuning to parameter-efficient methods like LoRA and QLoRA, with practical guidance on when to fine-tune versus when to just write a better prompt.

Learning Objectives

  • Explain why pre-trained LLMs need fine-tuning to be useful
  • Describe the supervised fine-tuning process and key hyperparameters
  • Understand how LoRA achieves 99%+ parameter reduction
  • Compare full fine-tuning with parameter-efficient methods
  • Decide when to fine-tune versus when to use prompting

Step 1: Why Pre-training Isn't Enough

A pre-trained LLM like GPT or LLaMA has learned language from trillions of tokens scraped from the internet. It can predict the next token brilliantly. But it has a fundamental problem: **it doesn't know what you want it to do.** Ask a pre-trained model "What is the capital of France?" and it might continue with: - "What is the capital of Germany? What is the capital of..." (it sees a pattern of questions) - "...a question often asked by geography students" (it describes the question) - "Paris" (what you actually wanted — but only sometimes) The model is a **text completion engine**, not an assistant. It hasn't learned to: - Follow instructions ("Summarize this article in 3 bullet points") - Stay on topic (instead of generating tangentially related text) - Refuse harmful requests ("How to build a bomb" → should decline) - Match a specific tone, format, or domain **Fine-tuning** bridges this gap: you take the pre-trained model and continue training it on a curated dataset that teaches the specific behavior you want. The model keeps its broad language knowledge but learns new skills on top.

Step 2: Supervised Fine-Tuning (SFT)

**Supervised Fine-Tuning (SFT)** is the most straightforward fine-tuning approach. You prepare a dataset of (input, desired_output) pairs and train the model to produce the desired output given the input. **How it works:** 1. Collect a dataset of high-quality examples: prompts paired with ideal responses 2. Format them as training examples (e.g., "Prompt: ... Response: ...") 3. Continue training the pre-trained model on this dataset with a **small learning rate** 4. The model adjusts its weights to produce outputs that match your examples **The learning rate is critical.** Too high and you destroy the pre-trained knowledge (catastrophic forgetting). Too low and the model barely changes. Typical SFT learning rates are 10-100x smaller than pre-training (e.g., 1e-5 vs 1e-4). **Dataset size matters less than quality.** A few thousand high-quality examples often outperform millions of mediocre ones. The Alpaca paper showed that 52,000 instruction-following examples could turn LLaMA into a capable assistant. Notice in the loss curve: SFT starts at a much lower loss than pre-training (because the model already understands language) and converges faster (because it's learning a narrow task, not all of language).

Step 3: Instruction Tuning

**Instruction tuning** is a specialized form of SFT where the training data consists of explicit instructions paired with correct responses. This teaches the model to be a general-purpose instruction follower rather than a specialist. **The key innovation:** Instead of training on a single task, you train on a **diverse mix of tasks**, all formatted as instructions: - "Translate the following English text to French: ..." - "Summarize this article in 3 bullet points: ..." - "Write a Python function that sorts a list: ..." - "Is the sentiment of this review positive or negative? ..." **Why this works:** The model learns a **meta-skill**: "when given an instruction, follow it." This generalizes to new instructions it has never seen. A model instruction-tuned on 1,000 task types can often handle task type 1,001. **Key datasets:** - **FLAN** (Google): 1,836 tasks, 15M examples — showed that instruction tuning dramatically improves zero-shot performance - **Alpaca** (Stanford): 52K examples generated by GPT-4 — showed small high-quality datasets suffice - **OpenAssistant**: 160K messages from human volunteers — community-driven - **Dolly** (Databricks): 15K examples written by employees — commercially licensed The quality and diversity of your instruction dataset is the single biggest factor in the fine-tuned model's capabilities.

Step 4: Parameter-Efficient Methods (LoRA, QLoRA)

Full fine-tuning updates every parameter in the model. For a 70B-parameter model, this requires enormous GPU memory (280+ GB just for the model weights in FP32, plus optimizer states). **Parameter-efficient fine-tuning (PEFT)** methods make this tractable. **LoRA (Low-Rank Adaptation):** Instead of updating the full weight matrix W (d×d), LoRA freezes W and adds a small trainable bypass: W + ΔW, where ΔW = A × B. A is (d×r) and B is (r×d), with r << d (typically r=8 to 64). This means instead of training d² parameters per matrix, you train only 2×d×r parameters — a **99%+ reduction.** For a 7B model: full fine-tuning updates ~7B parameters. LoRA with r=16 updates ~10-50M parameters (0.1-0.7% of total). Yet performance is often within 1-2% of full fine-tuning. **QLoRA:** Combines LoRA with 4-bit quantization of the frozen base model. The base weights are stored in 4-bit precision (instead of 16-bit), cutting memory by 4x. Only the LoRA adapters are in full precision. This lets you fine-tune a 65B model on a single 48GB GPU. **Key insight:** The weight updates during fine-tuning live in a **low-rank subspace** — you don't need to change every parameter, just the right directions in weight space.
W' = W_frozen + A × B,  where A ∈ ℝ^(d×r), B ∈ ℝ^(r×d), r << d

Step 5: Full vs Partial Fine-Tuning Tradeoffs

Choosing between full fine-tuning and PEFT methods involves several tradeoffs that depend on your specific situation. **When full fine-tuning wins:** - You have enough GPUs (8+ A100s for a 70B model) - Maximum quality matters more than cost - The task requires deep behavioral changes (e.g., teaching a new language) - You need to modify knowledge stored deep in the model **When LoRA/QLoRA wins:** - Limited hardware (1-2 consumer GPUs) - You need to fine-tune many models (one base + many LoRA adapters) - The task is about style/format, not deep knowledge changes - You want to quickly swap between specializations (LoRA adapters are tiny files) **The adapter stacking advantage:** With LoRA, the base model stays frozen. You can create separate adapters for different tasks (medical, legal, creative) and swap them at inference time by loading different adapter files. This is impossible with full fine-tuning — you'd need a separate copy of the entire model for each task. **Catastrophic forgetting risk:** Full fine-tuning has a higher risk of catastrophic forgetting (losing pre-trained knowledge) because all weights change. PEFT methods are inherently more conservative since most weights are frozen.

Step 6: When to Fine-Tune vs Prompt

Before investing time and compute in fine-tuning, ask: **can prompting solve this?** **Try prompting first when:** - The task can be described in natural language instructions - You have fewer than 100 examples - You need quick iteration (minutes vs hours) - The task is well within the model's existing capabilities - You want flexibility to change behavior without retraining **Fine-tune when:** - Prompting consistently fails or produces inconsistent results - You have 1,000+ high-quality examples - You need specific output formats the model struggles with - Latency matters (fine-tuned models don't need long system prompts) - You need to embed domain-specific knowledge or terminology - Cost matters at scale (shorter prompts = fewer tokens = less cost) **The progression:** Most production systems follow this path: 1. Start with prompting (zero-shot, then few-shot) 2. If prompting isn't enough, try better prompts (chain-of-thought, structured prompting) 3. If still insufficient, fine-tune with a small dataset 4. If quality still needs improvement, collect more data and fine-tune again **Cost comparison:** A fine-tuned model with a short prompt can be 3-10x cheaper per request than a base model with a long system prompt, because you're paying per token.

Step 7: Test Your Understanding

You've learned about supervised fine-tuning, instruction tuning, LoRA, and when to fine-tune versus prompt. Let's check your understanding!

Prerequisites

  • Understanding of how LLMs generate text
  • Basic knowledge of training neural networks (loss, learning rate)

Key Concepts

  • Supervised Fine-Tuning (SFT)
  • Instruction Tuning
  • LoRA and QLoRA
  • Parameter-Efficient Fine-Tuning (PEFT)
  • Catastrophic Forgetting
  • Prompt Engineering vs Fine-Tuning