- Published on
Attention resolves through DNS
- Authors

- Name
- Sunil Tiwari
- @sunil28071987
The attention mechanism is one of the core ideas behind Transformers and modern language models. At first, it looks like a scary pile of symbols:
But once you break it down, the whole thing is just three steps. Here's the mnemonic:
DNS — Dot, Normalize, Sum
D = Dot product — compute similarity scores between query and keys.
N = Normalize — apply softmax to turn scores into attention weights.
S = Sum — take the weighted sum of values.
So:
Or, written in matrices:
That's the skeleton. Attention resolves through DNS. Now let's build the whole thing carefully — from token embeddings all the way to the output, with interactive demos along the way so you can play with each step.
Video overview
id or url prop.If you'd rather read straight through, skip the video and continue below.
1. Start with tokens
Take a sentence:
The cat drank milk because it was thirsty.
The model first splits this sentence into tokens. For simplicity, pretend each word is one token:
The, cat, drank, milk, because, it, was, thirsty
There are 8 tokens.
A Transformer cannot directly understand words as text. It needs numbers. So each token is converted into a vector called an embedding.
An embedding is a list of numbers representing a token. For example, a tiny fake embedding for cat might be:
Real embeddings are much larger: 512, 768, 4096, or even more dimensions depending on the model.
If we stack all token embeddings into a matrix, we get . For 8 tokens, each with 4-dimensional embeddings:
This means is a matrix with 8 rows and 4 columns. Rows = tokens. Columns = embedding dimensions.
Each row is a token vector.
2. What does mean?
This notation is common in machine learning. Read it as:
X belongs to the set of real-number matrices with 8 rows and 4 columns.
Breaking it down:
- means real numbers: , , , , .
- means all matrices with 8 rows and 4 columns, where each entry is a real number.
In attention:
- Rows usually represent tokens.
- Columns usually represent vector dimensions.
So means has 3 token rows, and each query vector has 2 dimensions.
This shape notation is not decoration — it tells you whether matrix multiplication is legal. For example:
works because the middle dimensions match ().
3. Embeddings alone are not enough
A token embedding tells the model what the token roughly means:
cat= animal, noun, living thing, singular, etc.milk= liquid, noun, drinkable, etc.it= pronoun
But there is a problem.
A basic embedding by itself does not tell the model where the word appears in the sentence. Compare:
Dog bites man.
and:
Man bites dog.
Same words. Different order. Very different meaning.
Without word order, the model would have a mess. It would know the words dog, bites, man but not who bites whom. That is why Transformers need positional information.
4. Positional encoding: telling the model word order
Attention by itself is permutation-blind. If you shuffle the tokens, raw attention does not naturally know the original order. So we add positional encoding.
Let be the token embedding matrix and be the positional encoding matrix. Then:
Here:
- = token meaning
- = token position
- = token meaning plus position
For example:
This means "cat plus I am at position 2." And:
This means "it plus I am at position 6."
So after positional encoding, the model does not just know cat — it knows something closer to cat at position 2. And not just it, but it at position 6. This matters enormously for grammar, reference, syntax, and meaning.
5. Why positional encoding helps in pronoun resolution
Consider again:
The cat drank milk because it was thirsty.
The word it could refer to cat or milk. Both are nouns. Both came before it. But cat is more plausible because a cat can be thirsty.
However, meaning alone is not enough. The model also needs order and structure:
The(1) cat(2) drank(3) milk(4) because(5) it(6) was(7) thirsty(8)
Positional information helps the model learn patterns like:
- Pronouns often refer to earlier nouns.
- Nearby nouns may matter.
- Subjects often matter.
- Words after "because" explain a cause.
- Adjectives often describe nearby nouns/pronouns.
Without position, the model would struggle to know which word came first, which is subject-like, which is object-like, and which phrase belongs with which. Positional encoding gives attention the sentence's geometry.
Bluntly: attention tells the model who is relevant; positional encoding tells the model where everyone is standing in line.
6. From Z to Q, K, and V
Once we have position-aware token vectors , the model creates three new matrices:
These are:
- : Query
- : Key
- : Value
The matrices , , and are learned during training. They are called projection matrices.
7. What does "projecting embeddings" mean?
Projection means: multiply a vector by a learned matrix to create a new version of the vector.
For example, suppose one token vector is and the model has learned this matrix:
Then:
So the original vector was transformed into . That is projection.
In Transformers, we project the same input three different ways:
The model creates three different learned views of the same tokens.
8. Why divide embeddings into Q, K, and V?
This is one of the most important ideas.
A token embedding is general-purpose. It contains lots of mixed information: meaning, grammar, position, token identity, semantic role, contextual hints.
But attention needs three different jobs:
- What am I looking for?
- How should others find me?
- What information should I provide if selected?
That is why attention uses:
| Vector | Meaning | Job |
|---|---|---|
| Query | What am I looking for? | |
| Key | How should I be found? | |
| Value | What information do I provide? |
For the word it, might represent: "I am a pronoun looking for a referent."
For the word cat, might represent: "I am a singular living noun and a possible referent."
For the word milk, might represent: "I am a singular noun, object, liquid."
Then it compares its query with the keys of other tokens.
9. Why exactly three: Q, K, and V?
Because attention is basically key-value retrieval.
Think of a dictionary, database, or search engine. You have:
query → compare against keys → retrieve values
Example database:
- query:
user_id = 42 - key:
42 - value: full user record
You do not retrieve the key itself. You use the key to find the value.
Attention works the same way:
- is the search query.
- is the searchable address/index.
- is the content/payload.
This is why three is the natural minimum. You need one thing that searches, one thing that gets searched, and one thing that gets returned.
Could we use only two? Technically yes, but weaker. If , then the same vector must be used both for matching and for content. That creates a bottleneck — the features useful for being found are not always the features useful for what should be passed along.
Example:
The trophy does not fit in the suitcase because it is too large.
To resolve it, the model needs to decide whether it refers to trophy or suitcase. The key features may include: noun, singular, candidate referent, position. But the value information may include: object size, containment role, semantic plausibility. Matching and payload are related, but not identical. That is why and are separated.
Could there be 4 or 5 vectors? In principle, yes. But Q/K/V is the clean core structure. Multi-head attention already creates multiple Q/K/V sets in parallel, so the model gets many specialized attention mechanisms without adding random extra roles.
10. Q and K produce attention scores
Once we have Q, K, and V, attention starts with a dot product:
This compares every query with every key.
Suppose we have 3 tokens: cat, drank, milk. And each token has a 2D query/key vector. Let:
Then:
Now:
This gives a score matrix. Rows are query tokens. Columns are key tokens.
| cat | drank | milk | |
|---|---|---|---|
| cat | 1 | 1 | 0 |
| drank | 1 | 2 | 1 |
| milk | 0 | 1 | 1 |
The row for drank is . In this toy setup, that means drank matches cat with score 1, itself with score 2, and milk with score 1.
11. Why dot product?
A dot product measures alignment/similarity between vectors. For two vectors and , the score is larger when they point in a similar direction, and smaller when they are less aligned.
So in attention, asks: "How well does token 's query match token 's key?"
For the pronoun example:
- asks: "How compatible is 'it' with 'cat' as something to attend to?"
- asks: "How compatible is 'it' with 'milk' as something to attend to?"
Dot product playground
Drag the Q and K vectors. When they point the same way, the dot product is large. When they're perpendicular, it's zero. When opposite, it's negative.
12. Why divide by ?
The attention formula is not just . It is:
where is the dimension of the key/query vectors.
Why divide? Because dot products get larger when vectors have more dimensions. If is large, the dot-product scores can become huge — and then softmax becomes too sharp.
Example:
softmax([1, 2, 3])gives a distribution where the biggest value wins, but not insanely.softmax([10, 20, 30])becomes almost all weight on the largest value.
That is bad for training because most positions get nearly zero gradient. The model becomes overconfident too early.
So we scale by to keep the score magnitudes controlled. Think of it as preventing softmax from becoming a drama queen.
13. Softmax: turning scores into weights
After dot product and scaling, we apply softmax:
Softmax converts raw scores into normalized weights. The weights have two important properties:
- Each weight is positive.
- Each row sums to 1.
So a row becomes a distribution over tokens.
For example, suppose the scores for it are:
| cat | drank | milk | |
|---|---|---|---|
| it | 5.2 | 0.7 | 1.8 |
After softmax:
| cat | drank | milk | |
|---|---|---|---|
| it | 0.96 | 0.01 | 0.03 |
This means it attends 96% to cat, 1% to drank, and 3% to milk. Now we have the attention matrix , which tells us where each token looks.
Softmax sharpness demo
Drag the raw scores. See how softmax turns them into a probability distribution. Toggle ÷√d_k and push d_k up to see why scaling matters — without it, softmax becomes a drama queen.
14. The attention matrix A
If the sentence has 8 tokens, then:
Why? Because every token can attend to every token. Rows = the token doing the looking. Columns = the token being looked at.
So with tokens The, cat, drank, milk, because, it, was, thirsty, the row for it might be:
| The | cat | drank | milk | because | it | was | thirsty | |
|---|---|---|---|---|---|---|---|---|
| it | 0.00 | 0.91 | 0.01 | 0.04 | 0.01 | 0.01 | 0.01 | 0.01 |
This would mean the token it mostly attends to cat.
In reality, actual attention patterns are spread across many heads and layers, so one row does not always map cleanly to a human interpretation. But as a learning model, this is the right intuition.
Attention matrix
Click any row token to see which other tokens it attends to. Rows = query (the token doing the looking). Columns = key (the token being looked at). Switch between raw scores and softmax-normalized weights.
| The | cat | drank | milk | because | it | was | thirsty | |
|---|---|---|---|---|---|---|---|---|
| The | 6% | 64% | 6% | 5% | 5% | 5% | 5% | 5% |
| cat | 12% | 52% | 16% | 5% | 4% | 3% | 4% | 4% |
| drank | 1% | 53% | 7% | 32% | 1% | 2% | 2% | 2% |
| milk | 3% | 4% | 49% | 30% | 4% | 3% | 3% | 4% |
| because | 7% | 9% | 33% | 9% | 15% | 7% | 8% | 11% |
| it | 1% | 84% | 2% | 3% | 2% | 3% | 2% | 4% |
| was | 3% | 5% | 4% | 4% | 4% | 31% | 6% | 43% |
| thirsty | 2% | 34% | 3% | 3% | 3% | 42% | 8% | 6% |
15. Now the final step: weighted sum
After we get attention weights , we use them to mix the value vectors :
This is the S in DNS: Dot → Normalize → Sum.
Here:
- tells us how much to take from each token.
- contains what each token can provide.
- is the resulting context-aware output.
For a single token :
Read this as: the output vector for token is the weighted sum of all value vectors , weighted by how much token attends to token .
This is the heart of attention.
16. Example of O = AV
Suppose for it, attention weights are:
corresponding to cat, drank, milk. Now suppose:
Stacked:
Then:
So .
Because cat had weight , the output for it is mostly based on . That means it now carries information from cat. This is how attention makes a token context-aware.
Weighted sum: O = AV
Drag the attention weight sliders to see how the output vector is built as a weighted blend of the value vectors. Push one weight high and the output shape mirrors that token's values.
17. Full matrix shape of O = AV
Suppose and . Then:
Shape: . So:
Meaning: 8 input tokens, 8 output token vectors, each output vector with 64 dimensions. Every token gets updated. The output for it is one row. The output for cat is another row. Each output row is a weighted mixture of the value vectors.
18. What attention actually accomplishes
Before attention, each token is mostly isolated:
it= pronouncat= animal nounmilk= liquid noun
After attention:
it= pronoun + information from catdrank= verb + information from cat and milkthirsty= adjective + information from it/cat
So attention turns each token into a context-aware token. That is the big idea.
The vector for it no longer just means "it." It now means something closer to "it, referring to cat in this sentence." The vector for drank may encode "drank, with subject cat and object milk." The vector for thirsty may encode "thirsty, describing the referent of it, likely cat."
Attention allows information to move between tokens.
19. Why not just average all tokens?
You might ask: why not just average all the word embeddings? Because not all tokens matter equally.
For understanding it, cat matters much more than drank or milk. A simple average would do:
That treats every token equally. Attention does:
That is much smarter. The model dynamically decides what matters depending on the current token and the context.
20. Why not just pick the highest token?
Another question: why use a weighted sum instead of just choosing the best token? Because language often requires combining information from several places.
Example:
The old cat near the window drank milk because it was thirsty.
To understand it, the model may need: cat (referent), old (description), near the window (context), drank (action), thirsty (state/reason). A hard choice would lose useful information.
Attention uses soft selection: mostly cat, some old, some drank, some nearby context, little or none from irrelevant tokens. This soft blending is powerful and differentiable, which means the model can learn it through gradient descent.
21. How Q, K, V relate to the pronoun example
In the sentence:
The cat drank milk because it was thirsty.
When processing it:
- acts like: "I am looking for what this pronoun refers to."
- might say: "I am a living noun, singular, earlier in the sentence."
- might say: "I am a liquid noun, singular, earlier in the sentence."
Then dot products produce scores:
Since cat is a better match, it gets a higher score. Softmax turns those scores into weights: cat: 0.96, milk: 0.03. Then:
So the output representation for it becomes mostly cat-informed. That is how attention can help resolve references.
22. Q and K are routing, V is payload
This sentence is worth remembering:
Q and K decide routing. V carries payload.
Routing means deciding where information should come from. Payload means the actual information being moved.
- : What am I looking for?
- : Am I relevant to that search?
- : Here is the information I provide if selected.
So decides routing. moves payload.
Analogy: = search query, = search index, = search result content. You do not return the index — you use the index to find the content. Same with attention. You do not use as the final content. You use .
23. The full attention formula again
Now the full formula should feel less scary:
Break it into parts:
- Dot product — compares every query with every key.
- Scaling — keeps scores numerically stable.
- Normalization — turns scores into attention weights. Call this .
- Weighted sum — uses attention weights to mix value vectors. Call this .
So where . Therefore:
That is the whole attention operation.
24. Full pipeline from sentence to attention output
Here is the entire process:
Step 1: Tokenize sentence — The, cat, drank, milk, because, it, was, thirsty.
Step 2: Convert tokens to embeddings — . Each token becomes a vector.
Step 3: Add positional encoding — . Now each vector knows token meaning and position.
Step 4: Project into Q, K, V:
Now each token has a query, key, and value version.
Step 5: Compute similarity scores — . Each token compares itself to every token.
Step 6: Scale scores — . Prevents softmax from becoming too sharp.
Step 7: Apply softmax — . Scores become attention weights.
Step 8: Weighted sum of values — . Each token pulls information from other tokens.
Here's the entire pipeline running on real (deterministic) matrices for a 9-token sentence with and . Step through each stage and hover any cell to see its value:
Attention pipeline (end-to-end)
Single-head attention on 9 tokens with d_model = 768, d_k = 64. Step through every operation from input embeddings to output. Hover any cell to see its value. Colors: positive / negative for unbounded matrices, white → blue for [0, 1] matrices like softmax.
Each token starts as a 768-dim vector. Stacked together, they form a 9 × 768 matrix.
25. Multi-head attention
Real Transformers usually do not use just one attention operation. They use multiple attention heads.
Each head has its own , , , so each head learns a different way to attend. One head might track pronoun → noun. Another might track verb → subject. Another might track verb → object. Another might track adjective → noun.
Each head computes:
Then the outputs are concatenated:
This lets the model capture different relationships in parallel.
A single attention head is one lens. Multi-head attention is a committee of lenses.
26. Attention variants: MHA, MQA, GQA, MLA
So far we've described the original multi-head attention from the "Attention Is All You Need" paper. That works beautifully for training, but for inference it has a problem: the KV cache.
The KV cache problem
During autoregressive decoding, each new token has to attend to every previous token. Rather than recomputing and for every previous token on every step, we cache them. That cache is the KV cache.
KV cache size per token, per layer:
The factor of 2 is because we store both and . Across all layers:
For a model like Llama-2-70B with , , — that's 2.6 MB per token of context. At 32K context, you're holding 80 GB of KV cache per request. This is the dominant cost of inference at long context.
The variants below exist to shrink this number.
MHA — Multi-Head Attention (original)
Every head has its own , , . Maximum expressivity, maximum cache.
MQA — Multi-Query Attention
One shared and across all heads. Each head still has its own , so queries differ — but all heads read from the same key/value pool.
That's an × reduction. For Llama-2-70B's numbers, the per-token cache drops from ~2.6 MB to ~40 KB. Massive. But quality takes a real hit because the model loses head-level diversity in what it can "look at."
GQA — Grouped-Query Attention
The compromise. Group heads together; each group shares one and . If you have heads and groups, you have 8 KV pairs instead of 64.
Llama-2-70B uses GQA with 8 groups: ~500 KB per token. Roughly smaller than MHA, with quality very close to MHA. This is what most production models use today (Llama, Mistral, Qwen, etc.).
MLA — Multi-Head Latent Attention (DeepSeek)
The newest of the four, introduced by DeepSeek-V2. Rather than shrinking and directly, cache a low-rank latent and reconstruct and on the fly.
The trick:
Then at attention time:
Only is cached — not and .
where is the latent dimension (e.g. 576 vs for the same model). The result: smaller cache than MQA, and quality competitive with or better than MHA. The reconstruction matrices are absorbed into the query/output projections at inference time so there's no extra compute.
Side-by-side
| Variant | What's shared | KV cache per token | Relative size | Quality |
|---|---|---|---|---|
| MHA | nothing | (baseline) | Best, but matched by others | |
| MQA | across all heads | Noticeable drop | ||
| GQA | within groups | Near MHA | ||
| MLA | latent compressed | Smallest in practice | Matches or beats MHA |
For a concrete model:
| Variant | KV cache per token (Llama-70B-scale) |
|---|---|
| MHA | ~2.6 MB |
| GQA (8 groups) | ~500 KB |
| MQA | ~40 KB |
| MLA (DeepSeek-V2 scale) | ~70 KB |
MLA isn't quite as small as MQA in absolute bytes — but the quality jump puts it on a different curve entirely. MQA gets you small cache and worse quality; MLA gets you small cache and good quality.
What changed and what stayed the same
In all four variants, the core operation is still D-N-S:
What changes is how and are produced and stored:
- MHA: stored directly, one per head.
- MQA: stored directly, one set shared across heads.
- GQA: stored directly, one set per group.
- MLA: stored as a low-rank latent, reconstructed on the fly.
The DNS recipe doesn't change. The plumbing around it does — and that plumbing is what determines whether you can serve a 128K-context request without melting a GPU.
27. Important caveat: attention is not literally English reasoning
When we say "the query for it asks which noun it refers to," that is a human-friendly interpretation. Inside the model, there is no English sentence saying "find the referent of this pronoun." There are only vectors and learned weights.
But through training, the model learns vector patterns that often behave like this. So the language we use is metaphorical but useful.
A more precise version would be:
The learned query representation for
ittends to assign higher compatibility scores to key representations of tokens that are useful for predicting or representing the pronoun's role in context.
That sentence is more accurate, but also a tiny academic sleep dart. The intuitive explanation is better for learning.
28. The core intuition in one story
Imagine each token is a person in a room. Each person has three cards:
- Query card: What I am looking for.
- Key card: What kind of information I have.
- Value card: The actual information I can share.
The token it walks into the room and checks everyone's key card. It compares its query card with each key card. It sees:
cat: strong matchmilk: weak matchdrank: very weak match
Then it assigns weights: cat: 0.96, milk: 0.03, drank: 0.01.
Then it collects information from their value cards:
Now it has a new representation that mostly contains information from cat.
That is attention.
29. Final cheat sheet
| Concept | Description | Notation |
|---|---|---|
| Embedding | A vector representing a token | |
| Embedding matrix | All token vectors stacked | |
| Positional encoding | Adds word-order information | |
| Projection | Learned transformation into another vector space | , , |
| Query | What this token is looking for | |
| Key | How this token can be found | |
| Value | What this token provides if selected | |
| Dot product scores | Compare queries with keys | |
| Scaling | Keep scores stable | |
| Softmax | Turn scores into weights | |
| Weighted sum | Mix values using attention weights | |
| Full attention | The whole operation |
30. Final mental model
A Transformer begins with token embeddings. It adds position so word order is known. It creates Q, K, and V so each token can search, be searched, and provide information. It uses dot products to score relevance. It uses softmax to normalize those scores into attention weights. It uses to mix information from value vectors.
The output is a new representation for every token, now informed by the relevant tokens around it.
The cleanest version:
Attention lets each token ask: "Who matters to me?" Then it pulls information from those tokens and updates itself.
Or using the mnemonic:
DNS: Dot product, Normalize, Sum.
And the killer one-liner:
Q and K decide where to look. V is what gets copied. Positional encoding tells the model where every word is standing.
