Glossary Entry

Self-Attention

Attention applied within a single sequence, so every token builds its new representation as a weighted mix of all the other tokens.

Architecture LLMs Representation

Also called: self attention, intra-attention

Seed source: Attention Is All You Need (Vaswani et al., 2017)

Earlier attention mechanisms connected two different sequences, such as a decoder attending over an encoder’s states. Self-attention turns the mechanism inward: each token in a sequence queries all the others, so any pair of positions is linked in a single step regardless of distance, and every position can be processed in parallel.

That parallelism is what let the transformer drop recurrence entirely and made large-scale pre-training practical. With a causal mask it becomes the decoder-style attention in GPT-family models; without the mask it is the bidirectional attention of BERT-style encoders.