The Transformer is the neural network architecture behind modern LLMs. Built on attention, it revolutionized language AI and enabled the current generation of models.
What architecture actually powers GPT, Claude, and other LLMs?
The Transformer. Introduced in a 2017 paper titled "Attention Is All You Need," it became the foundation for virtually every large language model that followed.
The name is almost anticlimactic for something so consequential. But the architecture genuinely transformed the field, making possible models that previous approaches couldn't scale.
What made it different?
Before Transformers, the dominant architectures for language were recurrent neural networks (RNNs). These processed text sequentially, word by word, maintaining a hidden state that accumulated information.
RNNs had problems:
Sequential processing: Can't parallelize. Each word waits for the previous word.
Long-range dependencies: Information from early words fades as it passes through many steps.
Training difficulty: Gradients vanish or explode over long sequences.
The Transformer solved all three by replacing recurrence with attention. No sequential dependencies. Direct connections between any positions. Perfectly parallelizable.
The basic structure
A Transformer stacks identical layers, each containing:
Self-attention: Every position attends to every other position
Feed-forward network: A simple neural network applied to each position independently
Layer normalization: Stabilizes training
Residual connections: Adds the input back to the output, helping gradients flow
Inside One Transformer Layer
Output + ResidualTo next layer
↓
Feed-Forward NetworkApplied to each position
↓
Layer Norm
↓
Self-AttentionEvery position attends to all others
↓
Layer Norm
↓
Input + ResidualFrom previous layer
Stack 12, 24, 96, or more of these layers. Each layer refines the representations, building more abstract understanding.
Why does it scale so well?
The Transformer has properties that happen to match modern hardware:
Parallelism: All positions can be processed simultaneously. GPUs thrive on parallel operations.
Regular structure: The same operations repeated many times. Easy to optimize.
Dense computation: Matrix multiplications dominate. GPUs are designed for exactly this.
RNNs require sequential steps that GPUs can't parallelize. Transformers turn language modeling into a massive parallel matrix operation. This is why training that once took months now takes weeks.
The architecture also shows clean scaling behavior. Double the parameters and you get predictable improvements. This reliability let researchers confidently invest billions in larger models.
The components in detail
Positional encoding: Attention is position-agnostic (it doesn't inherently know word order). Positional encodings add position information to embeddings so the model knows where each token appears.
Multi-head attention: Multiple attention mechanisms in parallel, each learning different relationship patterns. Typically 8-96 heads per layer.
Feed-forward network: A two-layer neural network with a larger hidden dimension, applied identically to each position. This is where much of the "thinking" happens.
Layer normalization: Normalizes activations to stabilize training. Applied before or after attention and feed-forward blocks depending on the variant.
Residual connections: The input to each sub-layer is added to its output. Instead of just output = f(input), you get output = f(input) + input. This lets gradients flow directly through many layers, enabling very deep networks. Layers learn refinements rather than complete transformations.
What's in a name: GPT, BERT, and friends
GPT (Generative Pre-trained Transformer): OpenAI's decoder-only models. "Generative" because they generate text.
BERT (Bidirectional Encoder Representations from Transformers): Google's encoder-only model. "Bidirectional" because attention can look both ways.
T5 (Text-to-Text Transfer Transformer): Google's encoder-decoder model that frames all tasks as text-to-text.
LLaMA, Claude, PaLM, Gemini: All Transformer variants with various modifications.
The details differ, but the core is the same: attention layers stacked deep, processing tokens in parallel.
The Transformer's legacy
Nearly every significant language model since 2018 is a Transformer or close variant. The architecture proved remarkably robust: scale it up, train it on more data, and it keeps improving.
This wasn't inevitable. Researchers tried many architectures. Most hit walls. The Transformer scaled gracefully, and that made all the difference.
Today, "LLM" almost implies "Transformer-based." The two concepts are so intertwined that understanding Transformers is understanding how modern AI thinks.