Glossary Entry

Grouped-Query Attention

An attention variant where groups of query heads share a single key/value head, shrinking the KV cache with little quality loss.

Architecture LLMs Optimization

Also called: GQA, grouped query attention

Seed source: GQA paper (Ainslie et al., 2023)

Classic multi-head attention gives every query head its own key and value head, so the KV cache scales with the full head count. Grouped-query attention (GQA) partitions the query heads into groups that each share one key/value head, sitting between multi-head attention (one K/V per query head) and multi-query attention (one K/V for all).

The payoff is a much smaller KV cache and faster decoding at close to multi-head quality, which is why GQA became the default in model families like Llama 3, Mistral, and Qwen. Llama 3 70B, for example, runs 64 query heads against 8 key/value heads, an 8x cache reduction.