Glossary Entry

KV Cache

A store of every past token's attention key and value vectors, reused at each generation step so the model never recomputes them.

Architecture LLMs Optimization

Also called: KV caching, key-value cache, key/value cache

During autoregressive generation, a transformer’s causal attention means past tokens’ key and value vectors never change once computed. The KV cache stores them (per head, per layer) so each new decode step only processes the newest token, dropping per-step cost from quadratic to linear in sequence length without changing the output at all.

The trade is memory: the cache grows linearly with context length and batch size, and in long-context serving it can exceed the model weights themselves. Techniques like grouped-query attention, latent compression, quantization, and paged memory management exist largely to keep this cache small.