Attention gives a model a way to focus more on some parts of an input sequence than others. In transformer-based models, that is what helps preserve context and capture relationships between words that may be far apart.
Across the LLM posts here, attention is one of the main reasons modern language models are more capable than simpler sequence models. It is a core building block behind transformers rather than just a minor optimization.
