Glossary Entry

Sliding-Window Attention

A sparse attention pattern where each token attends only to a fixed-size window of recent tokens, bounding compute and cache growth on long contexts.

Architecture LLMs Optimization

Also called: SWA, sliding window attention, local attention

Seed source: Mistral 7B (Jiang et al., 2023)

Instead of scoring every past token, each position sees only the previous w tokens, so per-token cost and cache size stop growing once the context exceeds the window. Stacked layers widen the effective receptive field the way stacked convolutions do: information propagates window by window, layer by layer.

Production models rarely use it everywhere. The common recipe interleaves several windowed layers with an occasional full-attention layer (Gemma 3 uses five local layers per global one), keeping most of the stack cheap while the global layers preserve long-range retrieval.