A language model is a system that assigns probabilities to sequences of words. At its heart, it answers one question:

Given the words so far, what word comes next?

For example, given "The cat sat on the ___", a good language model would assign high probability to words like "mat," "floor," or "couch" and low probability to words like "elephant" or "algorithm."

This simple idea — predicting the next word — turns out to be extraordinarily powerful. By training a model to predict the next word across billions of sentences, the model implicitly learns grammar, facts, reasoning patterns, and even common sense.

Formally: A language model estimates P(w_t | w_1, w_2, ..., w_{t-1}) — the probability of the next token given all previous tokens. The better the model estimates these probabilities, the better it "understands" language.

Every chatbot, autocomplete system, and text generator you use is built on this foundation.

Candidate Word

Probability

Why Likely?

mat

0.18

Common phrase "cat sat on the mat"

floor

0.12

Cats sit on floors frequently

couch

0.09

Another common surface for cats

roof

0.04

Possible but less common

elephant

0.0001

Makes no grammatical/semantic sense here

What Is an LLM? — Lesson Content

Understand what large language models are, how they evolved from simple n-grams to transformer-based giants, and what they can (and cannot) do.

Large Language Models (LLMs) like GPT-4, Claude, and Llama have transformed AI. But what are they, really? At their core, they predict the next word — and this simple objective, applied at massive scale, produces models that can write, reason, code, and converse. This lesson traces the evolution from n-gram counting to neural language models to today's billion-parameter transformers. You'll learn what makes them "large," how they build contextual understanding, what they're capable of, and — critically — where they fall short, including the hallucination problem.

Learning Objectives

Define what a language model does at a fundamental level
Trace the evolution from n-grams to transformer-based LLMs
Explain the three dimensions of scale that make an LLM "large"
Describe how LLMs build contextual understanding through self-attention
List key LLM capabilities and their critical limitations

Step 1: What Is a Language Model?

A **language model** is a system that assigns probabilities to sequences of words. At its heart, it answers one question: > **Given the words so far, what word comes next?** For example, given "The cat sat on the ___", a good language model would assign high probability to words like "mat," "floor," or "couch" and low probability to words like "elephant" or "algorithm." This simple idea — predicting the next word — turns out to be extraordinarily powerful. By training a model to predict the next word across billions of sentences, the model implicitly learns grammar, facts, reasoning patterns, and even common sense. **Formally:** A language model estimates P(w_t | w_1, w_2, ..., w_{t-1}) — the probability of the next token given all previous tokens. The better the model estimates these probabilities, the better it "understands" language. Every chatbot, autocomplete system, and text generator you use is built on this foundation.

Step 2: From N-grams to Neural Language Models

Language modeling has a long history. Each era brought a fundamentally new approach: **N-gram models (1990s-2000s):** Count word sequences in a corpus. A bigram model estimates P("mat" | "the") by counting how often "the mat" appears vs other "the X" patterns. Simple but limited — they can only look at the last 2-5 words and struggle with rare combinations. **Neural LMs (2003-2017):** Bengio's neural network mapped words to continuous vectors (embeddings) and used a feed-forward network to predict the next word. This was revolutionary: similar words got similar vectors, so the model could generalize. RNN-based models (LSTM, GRU) extended this to handle variable-length context. **Transformer LMs (2017+):** The transformer architecture replaced RNNs with self-attention, enabling massive parallelism and better long-range dependencies. This unlocked the ability to train on vastly more data. **Large Language Models (2018+):** GPT, BERT, and their successors showed that scaling transformers to billions of parameters on internet-scale data produces models with remarkable capabilities — from answering questions to writing code. Each step was not just an incremental improvement but a qualitative leap in what language models could do.

Step 3: The "Large" in LLM

What makes a language model "large"? Three dimensions of scale work together: **1. Parameters:** The weights of the neural network. More parameters = more capacity to store patterns. GPT-1 had 117 million parameters. GPT-4 has an estimated 1.8 trillion — a 15,000x increase in just five years. **2. Training data:** LLMs are trained on enormous text corpora — often terabytes of web pages, books, code, and scientific papers. GPT-3 trained on ~570 GB of text. More recent models train on multi-trillion token datasets. **3. Compute:** Training an LLM requires massive GPU/TPU clusters running for weeks or months. GPT-4's training cost is estimated at over $100 million. This compute budget is what enables learning from so much data. The key insight is that these three dimensions interact: more parameters need more data to train effectively, and more data needs more compute to process. Getting the balance right is crucial — this is the domain of **scaling laws** (covered in a later lesson). When people say "large" language model, they typically mean models with billions or more parameters, trained on internet-scale data with substantial compute budgets.

Step 4: How LLMs Understand Context

LLMs don't truly "understand" language the way humans do, but they build remarkably rich internal representations. Here's how: **Contextual embeddings:** Unlike older models that assign each word a single fixed vector, LLMs compute a **different representation for each word depending on its context**. The word "bank" gets a completely different internal representation in "river bank" versus "bank account." This is powered by the self-attention mechanism in transformers. **Layered abstraction:** An LLM processes text through many transformer layers (e.g., 96 layers in GPT-3). Research shows that: - **Early layers** capture syntax: word order, grammar, part-of-speech - **Middle layers** capture semantics: word meaning, entity types, relationships - **Late layers** capture task-specific patterns: sentiment, intent, factual recall **Long-range dependencies:** Self-attention lets every token directly attend to every other token in the context window. "She" in sentence 20 can directly reference "Dr. Smith" in sentence 1 — no information bottleneck. **The context window** is the maximum number of tokens the model can "see" at once. GPT-3 had 2,048 tokens (~1,500 words). Modern models support 128K+ tokens (~100,000 words) — enough for entire books.

Step 5: What LLMs Can Do

Despite being trained on a single objective — predict the next token — LLMs develop a surprisingly broad set of capabilities: **Text generation:** Write essays, emails, stories, poems, and articles in any style or tone. **Question answering:** Answer factual questions by drawing on knowledge absorbed during training. Often matches or exceeds human performance on standardized tests. **Code generation:** Write, debug, and explain code in dozens of programming languages. Models like GPT-4 and Claude can pass coding interviews and solve competitive programming problems. **Reasoning:** Solve math problems, logic puzzles, and multi-step reasoning tasks — especially when prompted to "think step by step" (chain-of-thought prompting). **Translation:** Translate between languages, often rivaling specialized translation systems. **Summarization:** Condense long documents into concise summaries while preserving key information. **In-context learning:** Perhaps the most remarkable capability — LLMs can learn new tasks from just a few examples provided in the prompt, without any weight updates. This "few-shot" ability was a key surprise of GPT-3. These capabilities **emerge** from scale — small models can do none of these well, but at sufficient size, they appear almost spontaneously.

Step 6: Limitations & Hallucinations

LLMs are powerful but have critical limitations that every user should understand: **Hallucinations:** LLMs confidently generate text that sounds plausible but is factually wrong. They might invent citations, misstate dates, or fabricate events. This happens because the model is optimizing for "what sounds like a likely next token" — not for truth. The model has no internal fact-checker. **No real-time knowledge:** LLMs only know what was in their training data. They have a **knowledge cutoff** and cannot access current events, live data, or personal information unless provided in the prompt. **Reasoning failures:** While LLMs can mimic reasoning, they sometimes fail on problems requiring precise logic, counting, or arithmetic — tasks trivial for a calculator. They can also be fooled by superficially similar problems with different solutions. **Lack of true understanding:** LLMs manipulate statistical patterns in text. Whether they "understand" anything is debated, but they clearly lack human-like grounding — they've never seen, touched, or experienced the world they describe. **Bias:** LLMs absorb biases present in their training data — gender stereotypes, cultural biases, and more. Significant research effort goes into mitigation, but it remains an open problem. Awareness of these limitations is essential for using LLMs responsibly.

Step 7: Test Your Understanding

You've learned what language models are, how they evolved, what makes them "large," and their capabilities and limitations. Let's check your understanding!

Prerequisites

No prior AI knowledge required

Key Concepts

Language Modeling
Next-Token Prediction
N-grams vs Neural Models
Scale (Parameters, Data, Compute)
Contextual Embeddings
Capabilities & Limitations
Hallucinations

What is an LLM?

What Is a Language Model?

Next-Token Prediction: "The cat sat on the ___"

Probability Distribution Over Next Words