Attention lets a model focus on relevant parts of the input when producing each output. It's the mechanism that allows LLMs to understand context.
How does the model know which words matter for predicting the next one?
When you read "The cat sat on the ___", you know "mat" is likely because of "cat" and "sat." Not every word matters equally. Your brain focuses on the relevant parts.
Neural networks needed a similar ability. Enter attention: a mechanism that lets the model dynamically focus on different parts of the input depending on what it's trying to do.
Before attention, models processed text in fixed ways. With attention, the model learns where to look for each decision it makes.
The core idea: weighted combinations
Attention computes which parts of the input are relevant to each output. For every position in the sequence, it produces weights over all other positions. High weight means "pay attention here." Low weight means "ignore this."
These weights let the model combine information flexibly. When predicting a word that refers back to something earlier, attention can put high weight on that earlier word, effectively connecting them despite the distance.
Self-attention: every word attends to every other
In self-attention, each position in a sequence computes attention weights over all positions (including itself). This happens in parallel for every position.
The result: a new representation where each position has gathered information from wherever it was relevant. A pronoun's representation now includes information about its referent. A verb's representation includes information about its subject.
This is why LLMs can handle long-range dependencies. Attention creates direct pathways between any two positions, regardless of distance.
Multi-head attention: looking at many things at once
A single attention mechanism can only focus on one pattern at a time. But language has many simultaneous relationships: syntax, semantics, coreference, style.
Multi-head attention runs several attention mechanisms in parallel, each with different learned queries, keys, and values. One head might track grammatical agreement. Another might track semantic relatedness. Another might track position.
The outputs of all heads are combined, giving the model a rich, multi-faceted view of relationships in the text.
The math behind attention
For those who want the formulas:
Given query Q, key K, and value V matrices:
Attention(Q, K, V) = softmax(QK^T / โd) V
QK^T computes similarity scores between all queries and keys
Division by โd (dimension size) prevents scores from getting too large
Softmax converts scores to weights that sum to 1
Multiplying by V produces the weighted combination
This is computed for each attention head, then heads are concatenated and projected.
The elegance: it's all matrix multiplication, which GPUs excel at computing in parallel.
Why attention was revolutionary
Before the Transformer (2017), sequence models processed text step-by-step. To connect the first word to the hundredth, information had to flow through 99 intermediate steps. Information got diluted or lost.
Attention creates direct connections. The hundredth word can attend directly to the first. No intermediate steps, no dilution. This is why Transformers handle long contexts so much better than their predecessors.
It also parallelizes perfectly. All attention computations for all positions can happen simultaneously. Previous architectures had to process sequentially. This made Transformers dramatically faster to train.
๐๏ธ
Attention Pattern Visualizer
See which words attend to which other words in real sentences
Attention has costs
Computing attention between every pair of positions means cost grows quadratically with sequence length. Double the context, quadruple the computation.
This is why context windows have limits. A million-token context means computing attention between every pair of a million positions: a trillion operations per layer. Researchers work on efficient attention variants (sparse attention, linear attention) to reduce this cost.
For now, the quadratic cost is why large context windows require proportionally more compute.