Instead of scoring every past token, each position sees only the previous w tokens, so per-token cost and cache size stop growing once the context exceeds the window. Stacked layers widen the effective receptive field the way stacked convolutions do: information propagates window by window, layer by layer.
Production models rarely use it everywhere. The common recipe interleaves several windowed layers with an occasional full-attention layer (Gemma 3 uses five local layers per global one), keeping most of the stack cheap while the global layers preserve long-range retrieval.
