Query, Key, Value

The Three Projections That Drive Attention

Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Self-Attention
Step
1/ 7

Recap: From Words to Vectors

Before diving into Query, Key, and Value, let's recall where we are. Each word in "The cat sat down" has been converted to an 8-dimensional embedding vector and combined with positional encodings.

These input vectors contain information about:

  • What the word means (from the learned embedding)
  • Where the word sits in the sequence (from the positional encoding)

But these vectors don't yet know about context. The embedding for "cat" is the same whether the sentence is "The cat sat down" or "The cat chased the mouse." Self-attention will fix this by letting each token gather information from all other tokens.

The question is: how exactly does a token decide what to look for and where to find it? The answer is three learned projections: Query, Key, and Value.

Input Vectors: Embedding + Positional Encoding

TokenPositionInput Vector (first 4 dims)
The0[0.12, 0.66, 0.56, 1.78, ...]
cat1[1.75, 0.42, 0.44, 0.44, ...]
sat2[1.36, 0.25, -0.69, 1.10, ...]
down3[-0.20, -0.43, 1.08, 0.84, ...]

What Each Input Vector Contains

ComponentSourceEncodes
EmbeddingLearned lookup tableSemantic meaning of the word
Positional EncodingSinusoidal functionPosition in the sequence
Input VectorEmbedding + PositionalBoth meaning AND position