Softmax & Attention Output
From Scores to Context-Rich Representations
Dot Product Scoring: Q · K^T
Now that we have Query and Key matrices, how do we compute attention scores? The answer is the dot product — the simplest and most efficient measure of similarity between two vectors.
For each pair of tokens (i, j), the raw attention score is:
score(i, j) = Q_i · K_j = Sum(Q_i[d] × K_j[d]) for all dimensions d
In matrix form, we compute all scores at once:
Scores = Q × K^T
This is a single matrix multiplication that produces a 4×4 matrix — one score for every pair of tokens. The score at position (i, j) tells us how much token i's Query matches token j's Key.
Why dot product works: If two vectors point in similar directions, their dot product is large and positive. If they're orthogonal (unrelated), it's near zero. If they point in opposite directions, it's negative. The learned W_Q and W_K matrices are trained so that relevant Query-Key pairs produce high dot products.
Below are the raw scores (already divided by √d_k — we'll explain scaling next).
Raw Attention Scores (4×4)
Scaled Scores = Q·K^T / √d_k
Score Matrix: How Well Each Query Matches Each Key
| Query Token | Key: The | Key: cat | Key: sat | Key: down |
|---|---|---|---|---|
| The | 0.226 | 0.827 | 0.029 | 0.630 |
| cat | 0.413 | 0.820 | 0.094 | 0.587 |
| sat | 0.847 | 0.349 | -0.078 | 0.955 |
| down | -0.070 | 0.648 | 0.056 | 0.200 |