Classic multi-head attention gives every query head its own key and value head, so the KV cache scales with the full head count. Grouped-query attention (GQA) partitions the query heads into groups that each share one key/value head, sitting between multi-head attention (one K/V per query head) and multi-query attention (one K/V for all).
The payoff is a much smaller KV cache and faster decoding at close to multi-head quality, which is why GQA became the default in model families like Llama 3, Mistral, and Qwen. Llama 3 70B, for example, runs 64 query heads against 8 key/value heads, an 8x cache reduction.
