What is a Transformer?

The Transformer is the neural network architecture behind modern LLMs. Built on attention, it revolutionized language AI and enabled the current generation of models.

What architecture actually powers GPT, Claude, and other LLMs?

The Transformer. Introduced in a 2017 paper titled "Attention Is All You Need," it became the foundation for virtually every large language model that followed.

The name is almost anticlimactic for something so consequential. But the architecture genuinely transformed the field, making possible models that previous approaches couldn't scale.

What made it different?

Before Transformers, the dominant architectures for language were recurrent neural networks (RNNs). These processed text sequentially, word by word, maintaining a hidden state that accumulated information.

RNNs had problems:

  • Sequential processing: Can't parallelize. Each word waits for the previous word.
  • Long-range dependencies: Information from early words fades as it passes through many steps.
  • Training difficulty: Gradients vanish or explode over long sequences.

The Transformer solved all three by replacing recurrence with attention. No sequential dependencies. Direct connections between any positions. Perfectly parallelizable.

The basic structure

A Transformer stacks identical layers, each containing:

  1. Self-attention: Every position attends to every other position
  2. Feed-forward network: A simple neural network applied to each position independently
  3. Layer normalization: Stabilizes training
  4. Residual connections: Adds the input back to the output, helping gradients flow

Stack 12, 24, 96, or more of these layers. Each layer refines the representations, building more abstract understanding.

Why does it scale so well?

The Transformer has properties that happen to match modern hardware:

  • Parallelism: All positions can be processed simultaneously. GPUs thrive on parallel operations.
  • Regular structure: The same operations repeated many times. Easy to optimize.
  • Dense computation: Matrix multiplications dominate. GPUs are designed for exactly this.

RNNs require sequential steps that GPUs can't parallelize. Transformers turn language modeling into a massive parallel matrix operation. This is why training that once took months now takes weeks.

The architecture also shows clean scaling behavior. Double the parameters and you get predictable improvements. This reliability let researchers confidently invest billions in larger models.

What's in a name: GPT, BERT, and friends

  • GPT (Generative Pre-trained Transformer): OpenAI's decoder-only models. "Generative" because they generate text.
  • BERT (Bidirectional Encoder Representations from Transformers): Google's encoder-only model. "Bidirectional" because attention can look both ways.
  • T5 (Text-to-Text Transfer Transformer): Google's encoder-decoder model that frames all tasks as text-to-text.
  • LLaMA, Claude, PaLM, Gemini: All Transformer variants with various modifications.

The details differ, but the core is the same: attention layers stacked deep, processing tokens in parallel.

The Transformer's legacy

Nearly every significant language model since 2018 is a Transformer or close variant. The architecture proved remarkably robust: scale it up, train it on more data, and it keeps improving.

This wasn't inevitable. Researchers tried many architectures. Most hit walls. The Transformer scaled gracefully, and that made all the difference.

Today, "LLM" almost implies "Transformer-based." The two concepts are so intertwined that understanding Transformers is understanding how modern AI thinks.

Sources & Further Reading

📄 Paper
Attention Is All You Need
Vaswani et al. · Google · 2017
🔗 Article
The Illustrated Transformer
Jay Alammar · 2018
🎬 Video