Query, Key & Value Projections — Lesson Content
Learn how three learned projections give each token a search query, an index entry, and content to share — the foundation of attention scoring.
Self-attention needs each token to play three roles simultaneously: searching for relevant context, advertising what it contains, and providing content when attended to. Query, Key, and Value projections solve this with three separate learned linear transformations.
Using "The cat sat down" with 8-dimensional embeddings and concrete weight matrices, you'll see how Q, K, and V are computed, why three projections are better than one, and how they connect to the library analogy for information retrieval.
Learning Objectives
- Explain why three separate projections are needed
- Describe the role of Query, Key, and Value vectors
- Understand how Q, K, V are computed via matrix multiplication
- Connect Q, K, V to the library / information retrieval analogy
Step 1: Recap: From Words to Vectors
Before diving into Query, Key, and Value, let's recall where we are. Each word in "The cat sat down" has been converted to an 8-dimensional embedding vector and combined with positional encodings.
These input vectors contain information about:
- **What** the word means (from the learned embedding)
- **Where** the word sits in the sequence (from the positional encoding)
But these vectors don't yet know about **context**. The embedding for "cat" is the same whether the sentence is "The cat sat down" or "The cat chased the mouse." Self-attention will fix this by letting each token gather information from all other tokens.
The question is: how exactly does a token decide what to look for and where to find it? The answer is three learned projections: **Query**, **Key**, and **Value**.
Step 2: Why Three Separate Projections?
A naive approach to attention would compare input vectors directly: compute the similarity between each pair of input vectors and use that as the attention weight. But this has a fundamental limitation.
Consider the word "sat": it needs to **look for** its subject (a noun) but simultaneously **advertise** that it's a verb (so other words can find it). These are two different roles:
- **As a query:** "sat" is searching for nouns and modifiers — "Who sat? How?"
- **As a key:** "sat" is advertising "I'm an action verb in past tense"
A single vector can't optimally serve both purposes. The solution is to create **three different views** of each token using three separate learned weight matrices:
| Projection | Purpose | Analogy |
|---|---|---|
| **Query (Q)** | What am I looking for? | A search query |
| **Key (K)** | What do I contain? | A document's index entry |
| **Value (V)** | What do I provide? | A document's actual content |
Each projection is a learned linear transformation: Q = XW_Q, K = XW_K, V = XW_V. The network learns what makes a good query, key, and value during training.
Step 3: Query: "What Am I Looking For?"
The **Query** vector represents what information a token is searching for. It's computed by multiplying the input vector by the learned weight matrix W_Q:
**Q = X × W_Q^T**
For each token, the query encodes a kind of "search request." The network learns W_Q during training so that:
- Verbs generate queries that search for subjects and objects
- Adjectives generate queries that search for the nouns they modify
- Pronouns generate queries that search for their antecedents
For our sentence "The cat sat down" with 8-dimensional vectors, W_Q is an 8x8 matrix, and each token's query is an 8-dimensional vector.
Below you can see the Query matrix Q (4 tokens x 8 dimensions). Each row is one token's query vector — its "search request" broadcast to all other tokens.
The exact values are hard to interpret individually, but their **dot products with Key vectors** will determine the attention pattern. That's the magic: queries and keys are learned to produce meaningful similarity scores when dotted together.
Step 4: Key: "What Do I Contain?"
The **Key** vector represents what information a token has to offer. It's computed with a separate weight matrix W_K:
**K = X × W_K^T**
While the Query asks "what am I looking for?", the Key answers "what do I have?" The attention score between two tokens is the dot product of one token's Query and another's Key. High dot product = high relevance.
The network learns W_K so that tokens advertise themselves in a way that's compatible with the queries from tokens that should attend to them. For example:
- "cat" should have a Key that matches well with verb Queries (since verbs look for their subjects)
- "down" should have a Key that matches well with the Query of the verb it modifies
This learning happens automatically through backpropagation — the network adjusts W_Q and W_K together so that the resulting attention patterns help minimize the training loss.
Think of Keys as **index entries in a library catalog**. Each book (token) has an index entry describing its contents, and your search query is matched against these entries to find the most relevant books.
Step 5: Value: "What Do I Provide?"
The **Value** vector is the actual content that gets passed along when a token is attended to. It's computed with a third weight matrix W_V:
**V = X × W_V^T**
Here's the crucial distinction:
- **Query and Key** determine **how much** attention to pay (the attention weights)
- **Value** determines **what information** gets passed (the attention output)
This separation is powerful. A token's Key might advertise "I'm a noun," which helps verbs find it. But the Value it provides might encode rich semantic information like "animate entity, domestic animal, subject of sentence" — much more than what the Key advertised.
Continuing the library analogy:
- You search using the **index** (Q·K matching) to find relevant books
- But you read the **actual content** (V) of those books, not the index entries
The output for each token is a weighted sum of all Value vectors, where the weights come from the Query-Key matching. This is how context flows between tokens.
Step 6: Computing Q, K, V with Matrix Multiplication
Now let's see the actual computation. Given our input matrix X (4 tokens x 8 dims) and three weight matrices W_Q, W_K, W_V (each 8x8), we compute:
**Q = X × W_Q^T** (4×8 × 8×8 = 4×8)
**K = X × W_K^T** (4×8 × 8×8 = 4×8)
**V = X × W_V^T** (4×8 × 8×8 = 4×8)
Each is a simple matrix multiplication. The weight matrices are **learned parameters** — they start random and are refined during training through backpropagation.
All three projections can be computed in parallel (they don't depend on each other), which is another advantage over sequential RNN processing.
**Dimensions recap:**
- Input X: 4 tokens × 8 dimensions
- Each W matrix: 8 × 8 (input_dim × output_dim)
- Each output (Q, K, V): 4 tokens × 8 dimensions
In practice, the output dimension can differ from the input dimension. For multi-head attention (covered in a later lesson), Q, K, V are often projected to a smaller dimension and then split across multiple heads.
Step 7: Test Your Understanding
You've learned how Query, Key, and Value projections give each token three purpose-specific representations. Let's test your understanding!
Prerequisites
- Self-attention intuition
- Matrix multiplication basics
- Word embeddings
Key Concepts
- Query Projection
- Key Projection
- Value Projection
- Weight Matrices
- Linear Transformation
- Information Retrieval Analogy