Glossary Entry

FlashAttention

An exact, IO-aware implementation of attention that avoids materializing the full score matrix, cutting memory use and wall-clock time without changing outputs.

Architecture LLMs Optimization

Also called: Flash Attention, FlashAttention-2, FlashAttention-3

Seed source: FlashAttention (Dao et al., 2022)

Standard attention implementations are bottlenecked by reading and writing the n-by-n score matrix to GPU memory rather than by arithmetic. FlashAttention restructures the computation into tiles that fit in fast on-chip memory and uses an online softmax to normalize incrementally, so the big matrix never exists in memory at all.

Because every rescaling step is exact, the outputs match ordinary attention up to floating-point rounding; the gains are purely from moving fewer bytes. It is what fast attention kernels in mainstream frameworks dispatch to by default, and successive versions (FlashAttention-2 and 3) are further engineering refinements for newer GPU generations.