During autoregressive generation, a transformer’s causal attention means past tokens’ key and value vectors never change once computed. The KV cache stores them (per head, per layer) so each new decode step only processes the newest token, dropping per-step cost from quadratic to linear in sequence length without changing the output at all.
The trade is memory: the cache grows linearly with context length and batch size, and in long-context serving it can exceed the model weights themselves. Techniques like grouped-query attention, latent compression, quantization, and paged memory management exist largely to keep this cache small.
