Glossary Entry

Linear Attention

An attention family that replaces the softmax with a factorizable similarity, collapsing the key/value history into a fixed-size state and making cost linear in sequence length.

Architecture LLMs Optimization

Also called: linear attention mechanism, kernelized attention

Seed source: Transformers are RNNs (Katharopoulos et al., 2020)

If the attention similarity factorizes through a feature map, the sums over keys and values can be re-bracketed into a single running matrix that all queries share. Cost drops from quadratic to linear in sequence length, and the causal form is literally a recurrent network: a fixed-size state updated once per token, so generation needs no growing KV cache.

The trade is precise recall: a fixed-size state summarizing a long context cannot retrieve arbitrary tokens verbatim the way softmax attention can. Modern descendants such as gated DeltaNet variants add learned gates that edit the state, and current models typically interleave many linear-attention layers with occasional full-attention layers rather than using either alone.