Table of Contents
- TL;DR
- Act I: the bottleneck that started it all
- Act II: simpler scores, same recipe
- Act III: attention is all you need
- Interlude: the two bills come due
- Route 1: compute fewer scores
- Route 2: change the math
- Route 3: keep it exact, stop wasting memory traffic
- Route 4: shrink what generation remembers
- Where that leaves us
- Sources and further reading
Attention started life in 2014 as a modest patch for machine translation: a way to stop a recurrent network from cramming an entire sentence into one fixed-size vector. Within three years it had eaten the whole architecture. Within ten, the industry’s most interesting design arguments were no longer about whether to use attention but about which of its many descendants to use, because the original mechanism carries costs that grow brutally with context length. This post tells that story as an evolutionary one: what each form of attention is, the precise shortcoming that forced the next variant into existence, and the math, code, and pictures that make each step click.
TL;DR
- Attention is a differentiable lookup: score how relevant every stored item is to a query, softmax the scores into weights, and return the weighted average of the items. Every variant in this post is that recipe with a different answer to “which items?” and “scored how?”.
- 2014 to 2017 built the mechanism. Additive attention fixed the seq2seq bottleneck, dot-product attention made it cheap on matrix hardware, and the Transformer made it the entire model via self-attention and multiple heads.
- The Transformer’s bargain: perfect parallelism during training, paid for with a bill that grows quadratically in sequence length at training time, plus a per-token cost at inference that grows with everything generated so far.
- Everything since 2017 is four escape routes from that bill: compute fewer scores (sparse and sliding-window attention, now learned sparsity like NSA and DSA), change the math (linear attention and its gated hybrid revival), keep the math exact but stop wasting memory traffic (FlashAttention), and shrink what inference must remember (MQA, GQA, MLA, covered in depth in the KV cache post).
- Nothing is free. Each route trades a little quality, generality, or implementation simplicity for its savings, which is why frontier models in 2026 mix several of them in one stack rather than crowning a single winner.
Act I: the bottleneck that started it all
In 2014, the state of the art for machine translation was the sequence-to-sequence recurrent network: an encoder RNN reads the source sentence and summarizes it into its final hidden state, and a decoder RNN unrolls the translation from that single vector. The design has an obvious pinch point. Whether the sentence is five words or fifty, everything the decoder will ever know about it must survive the squeeze through one fixed-size vector. Cho et al. (2014) measured exactly the failure you would predict: translation quality falls off sharply as sentences grow longer than the ones seen in training.
Bahdanau, Cho and Bengio (2014) removed the pinch point with a mechanism they called an alignment model, and the framing they chose is still the cleanest way to understand every variant that followed. Instead of one context vector for the whole sentence, the decoder gets a fresh context vector at every step, built in three moves. Writing \(s_{t-1}\) for the decoder’s current state and \(h_1, \dots, h_n\) for the encoder’s per-word states:
\[\begin{aligned} e_{t,i} &= v_a^{\top} \tanh\!\big(W_a\, \textcolor{#2a78d6}{s_{t-1}} + U_a\, \textcolor{#c2410c}{h_i}\big) && \textcolor{#7b8794}{\small\text{1. score: a tiny MLP grades each source word}} \\[4pt] \alpha_{t,i} &= \frac{\exp(e_{t,i})}{\sum_{j=1}^{n} \exp(e_{t,j})} && \textcolor{#7b8794}{\small\text{2. normalize: softmax turns scores into weights}} \\[4pt] \textcolor{#6d5bc7}{c_t} &= \sum_{i=1}^{n} \alpha_{t,i}\, \textcolor{#199e70}{h_i} && \textcolor{#7b8794}{\small\text{3. mix: blend the states, heaviest weights dominate}} \end{aligned}\]Everything with an \(a\) subscript (\(v_a\), \(W_a\), \(U_a\)) is a learned parameter: together they form a one-hidden-layer network whose only job is to emit a single relevance number per source word, trained jointly with everything else. The three steps then read top to bottom as one pipeline: raw scores \(e\), weights \(\alpha\) that sum to one, and a context vector \(c_t\) that feeds the decoder’s next prediction.
The color coding here is deliberate and recurs through the post: blue marks the thing doing the asking (the query role), orange marks what gets scored against it (the key role), green marks the content that gets blended (the value role), and purple marks the output. In Bahdanau’s version the same encoder state \(h_i\) plays both the orange and green roles; separating them into distinct learned projections comes later, and is one of the Transformer’s quiet improvements.
Because every step is differentiable (a softmax rather than a discrete choice), the alignment is learned end-to-end with ordinary backpropagation, no alignment supervision needed. The learned weights turned out to be interpretable for free: heatmaps of \(\alpha\) recover word alignments, including the reorderings between English and French, purely as a side effect of training on translation.
Soft vs hard attention. What Bahdanau built is soft attention: every position gets some weight, and gradients flow through all of them. The alternative, hard attention, samples one position to look at, which is cheaper at inference but non-differentiable, so it needs reinforcement-learning-style estimators to train. Xu et al. (2015) compared both for image captioning in the “Show, Attend and Tell” paper. Soft attention won the ecosystem almost entirely because it trains easily, and every variant in the rest of this post is soft. Ironically, the newest learned-sparsity methods in Act V are a partial return of hard attention’s spirit: select a subset, ignore the rest.
Act II: simpler scores, same recipe
The next move was not a new capability but a simplification. Luong et al. (2015) asked how much of Bahdanau’s scoring machinery was actually necessary and tried three score functions:
\[\text{score}(\textcolor{#2a78d6}{s_t}, \textcolor{#c2410c}{h_i}) = \begin{cases} \textcolor{#2a78d6}{s_t^{\top}}\, \textcolor{#c2410c}{h_i} & \textcolor{#7b8794}{\small\text{dot: no parameters at all}} \\[3pt] \textcolor{#2a78d6}{s_t^{\top}}\, W_a\, \textcolor{#c2410c}{h_i} & \textcolor{#7b8794}{\small\text{general: one learned matrix in between}} \\[3pt] v_a^{\top} \tanh\!\big(W_a [\textcolor{#2a78d6}{s_t}; \textcolor{#c2410c}{h_i}]\big) & \textcolor{#7b8794}{\small\text{concat: Bahdanau's MLP, restated}} \end{cases}\]The finding that mattered: the dot product, the option with zero parameters, worked about as well as the learned MLP. That is a hardware result as much as a modeling one. Scoring one query against all \(n\) keys becomes a single matrix-vector product, and scoring all queries against all keys becomes one matrix-matrix multiply, exactly the operation GPUs are built to saturate. Additive attention needs the same number of comparisons but computes each one through a small network, which does not batch into one clean multiply. When the Transformer arrived two years later, it kept the dot product without serious debate; multiplicative attention had won on speed at equal quality.
Luong’s paper also planted a seed that would take eight years to matter: alongside global attention over the whole source sentence, it proposed local attention that looks only at a small window around a predicted position, to keep the cost of attending bounded. That idea, restricting which positions are visible rather than changing the scoring, is the direct ancestor of the sliding-window attention in Mistral and Gemma.
Act III: attention is all you need
Up to 2017, attention was an accessory bolted onto a recurrent network, and the recurrence was the problem. An RNN must process token \(t\) before token \(t+1\); training cannot parallelize along the sequence, and information between distant tokens has to survive a long chain of state updates. Vaswani et al. (2017) made the aggressive move: delete the RNN and keep only attention. Two ideas made that work.
Self-attention. Instead of a decoder attending to an encoder, every token in a sequence attends to every other token in the same sequence. Each token’s representation is rebuilt as a weighted mix of all the others, so any pair of positions is connected by a single step regardless of distance, and every position can be processed simultaneously. The sequential bottleneck of the RNN disappears; the original paper reports the big model reaching state-of-the-art translation after 3.5 days on 8 GPUs, a fraction of the training cost of the recurrent systems it beat, and that parallelism is what made the scaling era practical at all.
Distinct roles, learned separately. Each token’s vector \(x_i\) is projected three ways: a query \(\textcolor{#2a78d6}{q_i} = x_i W_Q\) (what am I looking for?), a key \(\textcolor{#c2410c}{k_i} = x_i W_K\) (what do I offer to be matched on?), and a value \(\textcolor{#199e70}{v_i} = x_i W_V\) (what content do I hand over if matched?). Bahdanau’s recipe then runs unchanged, in matrix form over the whole sequence at once:
\[\text{Attention}(\textcolor{#2a78d6}{Q}, \textcolor{#c2410c}{K}, \textcolor{#199e70}{V}) = \underbrace{\text{softmax}\!\left(\frac{\textcolor{#2a78d6}{Q}\textcolor{#c2410c}{K^{\top}}}{\sqrt{d_k}}\right)}_{n \times n \text{ weights}} \textcolor{#199e70}{V}\]The one new ingredient is the \(\sqrt{d_k}\) in the denominator, where \(d_k\) is the dimension of the query and key vectors, and it earns a short derivation because it explains a real failure mode. Assume the components of a query and a key are independent with mean 0 and variance 1, which is roughly what sensible initialization gives you:
\[\begin{aligned} \textcolor{#2a78d6}{q} \cdot \textcolor{#c2410c}{k} &= \textstyle\sum_{i=1}^{d_k} q_i k_i && \textcolor{#7b8794}{\small\text{a dot product is a sum of } d_k \text{ terms}} \\[4pt] \mathbb{E}[q_i k_i] &= \mathbb{E}[q_i]\,\mathbb{E}[k_i] = 0 && \textcolor{#7b8794}{\small\text{independent factors, each zero mean}} \\[4pt] \text{Var}(q_i k_i) &= \mathbb{E}[q_i^2]\,\mathbb{E}[k_i^2] = 1 && \textcolor{#7b8794}{\small\text{unit variances multiply}} \\[4pt] \text{Var}(\textcolor{#2a78d6}{q} \cdot \textcolor{#c2410c}{k}) &= d_k && \textcolor{#7b8794}{\small\text{independent terms: variances add up}} \\[4pt] \text{Var}\!\left(\frac{\textcolor{#2a78d6}{q} \cdot \textcolor{#c2410c}{k}}{\sqrt{d_k}}\right) &= 1 && \textcolor{#7b8794}{\small\text{dividing by } \sqrt{d_k} \text{ restores unit scale}} \end{aligned}\]Without the division, raw scores have standard deviation \(\sqrt{d_k}\) (that is 8 for a typical 64-dimensional head, and growing with head size), so the softmax sees inputs spread far apart, saturates toward a one-hot distribution, and its gradients vanish. One scalar keeps the softmax in the regime where it can actually learn. This is the kind of detail the Transformer got right that is easy to miss in the headline.
Multiple heads. A single attention distribution has to average all the relationships a token cares about into one set of weights. The Transformer instead runs \(h\) smaller attentions in parallel (8 in the original), each with its own \(W_Q, W_K, W_V\) over a \(d_{\text{model}}/h\)-dimensional slice, and concatenates the results. Different heads then specialize (some track syntax, some track nearby tokens, some track long-range links), and because each head works in a lower dimension, the total compute is close to a single full-width head.
The whole mechanism is compact enough to write out. Here is a full multi-head causal self-attention module, shapes annotated:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
class CausalSelfAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.n_heads, self.d_head = n_heads, d_model // n_heads
self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
self.out = nn.Linear(d_model, d_model, bias=False)
def forward(self, x): # x: [batch, n, d_model]
b, n, d = x.shape
q, k, v = self.qkv(x).chunk(3, dim=-1)
# -> [batch, heads, n, d_head]: every head attends independently
q, k, v = (t.view(b, n, self.n_heads, self.d_head).transpose(1, 2)
for t in (q, k, v))
scores = q @ k.transpose(-2, -1) / math.sqrt(self.d_head) # [b, h, n, n]
mask = torch.triu(torch.ones(n, n, dtype=torch.bool), diagonal=1)
scores = scores.masked_fill(mask, float("-inf")) # causal: no future
weights = F.softmax(scores, dim=-1) # rows sum to 1
ctx = weights @ v # [b, h, n, d_head]
return self.out(ctx.transpose(1, 2).reshape(b, n, d))
Two variants of this block matter for the story. With the causal mask (shown above), it is the decoder-style self-attention in every GPT descendant: position \(i\) sees only positions \(1\) through \(i\), so the model can be trained to predict every next token in parallel. Drop the mask and you have the bidirectional attention of BERT-style encoders. And when the queries come from one sequence but keys and values come from another, the same block is cross-attention, which is Bahdanau’s original setup living on inside translation models, multimodal models attending over image encodings, and diffusion models attending over prompt text.
That scores tensor is also where the trouble lives: it is \(n \times n\) per head. The Transformer traded the RNN’s sequential bottleneck for a quadratic one, and the rest of this post is about paying that bill.
Interlude: the two bills come due
For the 2017 paper’s 512-token sentences, quadratic cost was irrelevant. At 2026 context lengths, it is the design constraint. It helps to be precise about the two distinct bills, because different variants attack different ones:
- The compute bill (training and prefill). Building the score matrix costs \(O(n^2 d)\) work. At 512 tokens that is nothing; at 128k tokens the attention matrix alone has 16 billion entries per head per layer. Processing a long prompt (prefill) pays this in full.
- The memory bill (generation). Autoregressive decoding with a KV cache means every new token must read the keys and values of all previous tokens from GPU memory: cost per token grows linearly with context, and the cache itself, roughly half a megabyte per token for a Llama-2-class model, competes with the weights for GPU space. The KV cache post works through this arithmetic in detail.
Four families of variants emerged, each attacking the problem from a different angle, and they are complementary rather than competing:
Route 1: compute fewer scores
The most direct response to an \(n \times n\) matrix being too big is to not compute most of it. This works better than it has any right to, because trained attention matrices are empirically very sparse: most weight concentrates on a few tokens (recent ones, syntactically linked ones, and oddly, a handful of “sink” tokens at the start of the sequence).
The Sparse Transformer (Child et al., 2019) made the first systematic cut, giving each position a fixed pattern of a local window plus periodic strided positions, dropping the cost to \(O(n\sqrt{n})\) and demonstrating coherent generation on sequences of tens of thousands of steps. Longformer and BigBird (both 2020) refined the recipe for language: a sliding window over nearby tokens, plus a few global tokens that everything can see and that can see everything, plus (in BigBird’s case) random connections; BigBird also supplied the theory that such patterns retain the expressiveness of full attention. The mask gallery below is the fastest way to grasp the whole family: every panel is the same causal grid with different cells deleted.
The version that reached production LLMs is the plainest one: sliding-window attention, exactly Luong’s 2015 “local attention” idea reborn, where each token sees only the previous \(w\) tokens. Mistral 7B shipped it with a 4,096-token window in 2023, noting that stacked layers widen the effective receptive field the same way stacked convolutions do: information hops window by window, layer by layer. The modern refinement is interleaving: Gemma 3 uses five local layers (window 1,024) for every one full-attention layer, after finding the mix barely moves perplexity (the standard measure of how well a model predicts held-out text) while cutting the KV cache dramatically. A small window handles the local structure of language, which is most of it, and the occasional global layer handles the rest.
The frontier of this route replaces fixed patterns with learned sparsity: let the model decide which tokens matter instead of hard-coding “recent ones”. DeepSeek’s Native Sparse Attention (2025) trains the selection end-to-end from pretraining onward, combining compressed summaries of the past, a top-\(k\) token selector, and a local window, and reports up to 11.6x faster decoding at 64k context while matching or beating full attention on benchmarks, precisely because the sparsity is trained in rather than imposed after the fact. Its production sibling, DeepSeek Sparse Attention in V3.2, has a small “lightning indexer” score all prior tokens cheaply and keeps only the top 2,048 for real attention; the efficiency gain was large enough that DeepSeek cut its long-context API prices by half on release day. A useful way to file this development: soft attention learned to make hard(ish) choices without giving up trainability.
Route 2: change the math
The second route asks a more radical question: is the softmax itself the problem? Strip it away and something remarkable falls out of the algebra. Katharopoulos et al. (2020) generalized the attention weight to any similarity function that factors through a feature map \(\phi\), and then re-bracketed:
\[\begin{aligned} \text{out}_i &= \frac{\sum_{j} \text{sim}(\textcolor{#2a78d6}{q_i}, \textcolor{#c2410c}{k_j})\, \textcolor{#199e70}{v_j}}{\sum_{j} \text{sim}(\textcolor{#2a78d6}{q_i}, \textcolor{#c2410c}{k_j})} && \textcolor{#7b8794}{\small\text{attention, written for one query}} \\[6pt] &= \frac{\sum_{j} \phi(\textcolor{#2a78d6}{q_i})^{\top} \phi(\textcolor{#c2410c}{k_j})\, \textcolor{#199e70}{v_j}}{\sum_{j} \phi(\textcolor{#2a78d6}{q_i})^{\top} \phi(\textcolor{#c2410c}{k_j})} && \textcolor{#7b8794}{\small\text{kernel trick: similarity factorizes}} \\[6pt] &= \frac{\phi(\textcolor{#2a78d6}{q_i})^{\top} \textcolor{#6d5bc7}{S}}{\phi(\textcolor{#2a78d6}{q_i})^{\top} z} && \textcolor{#7b8794}{\small\text{re-bracket: the sums no longer depend on } i} \\[6pt] \textcolor{#6d5bc7}{S} &= \textstyle\sum_{j} \phi(\textcolor{#c2410c}{k_j})\, \textcolor{#199e70}{v_j^{\top}}, \qquad z = \textstyle\sum_{j} \phi(\textcolor{#c2410c}{k_j}) && \textcolor{#7b8794}{\small\text{one matrix and one vector, sizes independent of } n} \\[6pt] \textcolor{#6d5bc7}{S_i} &= \textcolor{#6d5bc7}{S_{i-1}} + \phi(\textcolor{#c2410c}{k_i})\, \textcolor{#199e70}{v_i^{\top}} && \textcolor{#7b8794}{\small\text{causal case: one running matrix, updated per token}} \end{aligned}\]The purple sum is the entire trick. In softmax attention, the coupling between \(q_i\) and each \(k_j\) lives inside a nonlinearity, so every query must touch every key: \(O(n^2)\) interactions, no way around it. Factor the similarity and the key-value information collapses into a single small matrix \(S\) (feature dimension by value dimension, nothing to do with sequence length) computed once and shared by all queries. Total cost becomes linear in \(n\), and the causal form in the last line reveals the deeper identity: linear attention is an RNN. The state \(S\) is a fixed-size running summary of everything seen so far, updated one token at a time; generation needs constant memory and constant time per token, no KV cache at all, which is why the paper reports orders-of-magnitude faster autoregressive generation on very long sequences.
The loop version makes the RNN-ness impossible to miss:
def linear_attention_decode(phi_q, phi_k, v):
"""phi_q, phi_k: [n, d_feat]; v: [n, d_v]. Causal, one token at a time."""
S = torch.zeros(phi_k.shape[1], v.shape[1]) # the entire "cache": d x d
z = torch.zeros(phi_k.shape[1]) # running normalizer
outs = []
for t in range(phi_q.shape[0]):
S = S + torch.outer(phi_k[t], v[t]) # fold token t into the state
z = z + phi_k[t]
outs.append(phi_q[t] @ S / (phi_q[t] @ z + 1e-6))
return torch.stack(outs)
So why didn’t this take over in 2020? Because the fixed-size state is also the weakness. Softmax attention keeps every key and value around and can retrieve any single token verbatim; a \(d \times d\) matrix summarizing 100k tokens necessarily smears them together, and precise recall (needle-in-a-haystack retrieval, exact copying, in-context lookups) suffers. Approximating the softmax kernel more faithfully, as Performer did with random features, helps but does not close the gap. For a few years linear attention looked like a dead branch.
The revival came from treating the state update as the design surface. Modern descendants like Gated DeltaNet add learned gates that control what enters and leaves the state (the “delta rule”: subtract the old association for a key before writing the new one, so memory gets edited rather than only accumulated). And crucially, nobody asks linear attention to do the whole job anymore: Qwen3-Next interleaves three Gated DeltaNet layers with one full-attention layer, Kimi Linear does the same with its Kimi Delta Attention against periodic full MLA layers, and NVIDIA’s Nemotron hybrids swap in Mamba-2 state-space layers, a close mathematical cousin. The pattern mirrors Gemma’s local-global interleaving exactly: cheap layers carry the bulk of the sequence modeling, and occasional exact-attention layers provide the precise recall the cheap layers cannot. Those remaining full-attention layers increasingly get their own stabilization dressing (“gated attention”: a learned sigmoid gate on the attention output plus normalization tweaks, which Qiu et al., 2025 found improves training stability at negligible cost).
Route 3: keep it exact, stop wasting memory traffic
The third route is the odd one out, and my favorite piece of systems thinking in this story: it made attention dramatically cheaper without changing a single output value. Dao et al. (2022) started from a profiling observation rather than a modeling one: on a GPU, standard attention is bottlenecked not by the multiply-adds but by reading and writing the \(n \times n\) score matrix to and from high-bandwidth memory. The matrix gets materialized, written out, read back for the softmax, written out again, and read back once more for the value multiply. The arithmetic is cheap; the memory round-trips are not.
FlashAttention restructures the computation so the big matrix never exists in memory. Keys and values are processed in tiles that fit in the GPU’s small on-chip SRAM, and the softmax is computed incrementally across tiles. That requires the one piece of new math in this route, the online softmax (Milakov and Gimelshein, 2018): a softmax normally needs to see all scores before it can normalize (the max is subtracted for numerical stability, and the max might be anywhere), but you can process scores block by block if you keep a running maximum and rescale what you have accumulated whenever a new block raises it. Writing \(\tilde{m}_t\), \(\tilde{\ell}_t\) and \(\tilde{O}_t\) for the maximum, the sum of exponentials, and the output contribution computed on block \(t\) alone, the running totals fold each block in like this:
\[\begin{aligned} m^{(t)} &= \max\!\big(m^{(t-1)},\, \tilde{m}_t\big) && \textcolor{#7b8794}{\small\text{running max, updated by the new block}} \\[4pt] \ell^{(t)} &= e^{\,m^{(t-1)} - m^{(t)}}\, \ell^{(t-1)} + e^{\,\tilde{m}_t - m^{(t)}}\, \tilde{\ell}_t && \textcolor{#7b8794}{\small\text{rescale the old sum if the max moved}} \\[4pt] \textcolor{#6d5bc7}{O^{(t)}} &= e^{\,m^{(t-1)} - m^{(t)}}\, \textcolor{#6d5bc7}{O^{(t-1)}} + e^{\,\tilde{m}_t - m^{(t)}}\, \tilde{O}_t && \textcolor{#7b8794}{\small\text{the output accumulator rescales the same way}} \end{aligned}\]Every correction factor is exact, so the final result is bit-for-bit ordinary attention (up to floating-point rounding), just computed in a memory-friendly order; memory use drops from quadratic to linear in \(n\), and wall-clock speedups land in the 2-4x range per attention call (the paper’s end-to-end numbers include 3x faster GPT-2 training at 1k context). The lineage continued as pure engineering: FlashAttention-2 reorganized the parallelism for another roughly 2x, and FlashAttention-3 rebuilt it around the H100’s asynchronous units and FP8. Today this is simply what F.scaled_dot_product_attention dispatches to when it can, which is a strange kind of victory: the most widely deployed attention variant of all is invisible in every architecture diagram, because it is the same math executed in a smarter order. It also composes with everything else in this post, and it quietly weakened the case for approximate methods at moderate lengths: why accept approximation error when exact attention got this fast?
Route 4: shrink what generation remembers
The fourth route targets the memory bill specifically: during generation, each new token must read every cached key and value, so the cache’s size sets both your memory footprint and your decode speed. The escape is to make many query heads share fewer key-value heads. Multi-query attention (Shazeer, 2019) keeps a single K/V head for all queries; grouped-query attention (Ainslie et al., 2023) interpolates, sharing each K/V head across a group of query heads (Llama 3 70B: 64 query heads, 8 KV heads, an 8x cache cut at near-full quality); DeepSeek’s multi-head latent attention (2024) instead caches a small learned compression of all heads’ K/V jointly, cutting the cache by 93% while matching full multi-head quality.
In code, the entire MHA-to-GQA evolution is three lines inside the attention block:
# MHA: n_kv_heads == n_heads. GQA: n_kv_heads < n_heads.
# Llama 3 8B uses 32 query heads and 8 KV heads (groups of 4).
k = k.repeat_interleave(n_heads // n_kv_heads, dim=1) # share each KV head
v = v.repeat_interleave(n_heads // n_kv_heads, dim=1) # across its group
# ...then attention proceeds exactly as before, but only the
# n_kv_heads original K/V tensors ever get cached.
I have kept this section short deliberately: the memory arithmetic, the MHA/GQA/MQA/MLA comparison figure, and why a 70B model can carry a smaller cache than a 7B one are all worked through in the KV cache post, which is really the fourth act of this story told from the inference side.
Where that leaves us
Laid end to end, the story has a satisfying shape: a mechanism invented to widen a model’s view (attend to everything, not one squeezed vector) spent its second decade learning to narrow it again (attend to less, remember less, move fewer bytes), without giving back the quality the widening bought.
Three observations worth carrying forward:
- The recipe never changed. Score, normalize, mix: every variant since 2014 is that loop with a different vocabulary of what may be scored. What evolved is the budget, from “everything, always” to “the right things, cheaply”.
- The winning ideas were often the boring ones. The dot product beat the learned MLP; the plain sliding window outlived most of the elaborate 2020-era sparse patterns; GQA, the least clever of the cache tricks, became the default in nearly every open model. In this lineage, simplicity plus hardware fit beats cleverness at equal quality, with FlashAttention as the purest case: zero modeling ideas, maximal impact.
- The frontier stopped picking one answer. Gemma 3 interleaves window sizes, Qwen3-Next and Kimi Linear interleave linear and full attention, DeepSeek stacks learned sparsity on top of latent compression. After a decade of “which attention?”, the 2026 answer is a portfolio: a few exact, expensive layers doing the precision work, and many cheap layers doing everything else. Whether the exact layers can eventually be evolved away entirely is one of the livelier open questions in architecture research.
Sources and further reading
- Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau, Cho, Bengio, 2014) — where attention begins; the additive alignment model and the learned word-alignment heatmaps.
- Effective Approaches to Attention-based Neural Machine Translation (Luong, Pham, Manning, 2015) — the dot/general/concat score comparison and the original local-attention idea.
- Attention Is All You Need (Vaswani et al., 2017) — self-attention, scaled dot-product, multi-head; the scaling argument in this post follows the paper’s footnote 4.
- The Big LLM Architecture Comparison: Understanding Attention Variants (Sebastian Raschka, 2026) — the survey of which variants ship in current open models; the source for the Gemma 3 interleaving details, the GQA-below-100B vs MLA-at-scale observation, and the hybrid-stack framing used in the final act.
- 15 types of attention mechanisms (Ksenia Se, 2025) — a compact taxonomy (soft/hard, self/cross, additive/multiplicative, local/global and more) that shaped this post’s framing of variants as answers to “which items, scored how?”.
- Generating Long Sequences with Sparse Transformers (Child et al., 2019), Longformer (Beltagy et al., 2020) and Big Bird (Zaheer et al., 2020) — the fixed sparse patterns in the mask gallery.
- Native Sparse Attention (Yuan et al., 2025) and DeepSeek-V3.2 (DeepSeek-AI, 2025) — learned sparsity, the lightning indexer, and the top-2048 selection numbers.
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Katharopoulos et al., 2020) — the kernel re-bracketing derivation reproduced above, and the linear-attention-is-an-RNN identity.
- FlashAttention (Dao et al., 2022) — IO-aware exact attention; the online softmax recurrence follows the paper’s Section 3 and Milakov & Gimelshein (2018).
- Fast Transformer Decoding: One Write-Head is All You Need (Shazeer, 2019), GQA (Ainslie et al., 2023) and DeepSeek-V2 (2024) — the KV-cache lineage, covered in depth in the KV cache post.
- Qwen3-Next blog post (2025) and Kimi Linear (2025) — the gated-DeltaNet hybrid stacks; Gated Attention for Large Language Models (Qiu et al., 2025) for the output-gate result.
