If the attention similarity factorizes through a feature map, the sums over keys and values can be re-bracketed into a single running matrix that all queries share. Cost drops from quadratic to linear in sequence length, and the causal form is literally a recurrent network: a fixed-size state updated once per token, so generation needs no growing KV cache.
The trade is precise recall: a fixed-size state summarizing a long context cannot retrieve arbitrary tokens verbatim the way softmax attention can. Modern descendants such as gated DeltaNet variants add learned gates that edit the state, and current models typically interleave many linear-attention layers with occasional full-attention layers rather than using either alone.
