How are LLMs trained?

LLMs learn by predicting text billions of times, adjusting their parameters to make better predictions. This simple process, at massive scale, produces remarkable capabilities.

How do you teach a neural network language?

You show it text. An enormous amount of text. And you ask it, over and over: what word comes next?

The model starts with random parameters (billions of numbers that mean nothing). Its predictions are gibberish. But each wrong prediction provides feedback. The parameters adjust slightly. Over billions of examples, the adjustments accumulate into something that understands language.

This is pre-training: the massive first phase where an LLM learns the patterns of language itself.

The training loop

Training follows a simple cycle:

Each iteration improves the model imperceptibly. But imperceptible improvements compound. After processing trillions of tokens, the model has learned patterns no human could articulate.

What data do they train on?

Scale requires enormous datasets:

  • Web crawls: Common Crawl provides billions of web pages
  • Books: Digital libraries and published text
  • Code: GitHub repositories and documentation
  • Conversations: Reddit, forums, dialogue datasets
  • Academic papers: Scientific literature across domains
  • Wikipedia: Encyclopedic knowledge

A typical large model trains on trillions of tokens from diverse sources. The diversity matters: it's why LLMs can discuss Shakespeare and Python, cooking and quantum physics. The model sees the full breadth of what humans write about.

What is "loss" and why minimize it?

Loss measures how wrong the model's predictions are. For LLMs, this is typically cross-entropy loss: roughly, how surprised was the model by the actual next token?

If the model predicted "mat" with 90% probability and the actual word was "mat," loss is low. If it predicted "mat" with 1% probability, loss is high.

Training minimizes average loss across all predictions. Lower loss means the model assigns higher probability to what actually came next. It's becoming a better predictor.

The compute required

Training large models requires staggering resources:

  • GPT-3 (175B parameters): Thousands of GPU-years of computation
  • Frontier models: Tens to hundreds of millions of dollars in compute
  • Training time: Weeks to months on thousands of specialized chips
  • Power consumption: Equivalent to small towns

This is why only a few organizations train frontier models. The capital requirements are prohibitive. Most practitioners use pre-trained models rather than training from scratch.

Stages of training

Modern LLMs typically go through multiple phases:

Pre-training: Massive text prediction on diverse data. Builds general language understanding and factual knowledge.

Supervised fine-tuning (SFT): Training on curated instruction-response pairs. Teaches the model to follow instructions and produce helpful responses.

Reinforcement learning from human feedback (RLHF): Training on human preferences about response quality. Aligns the model with what users actually want.

Each stage builds on the previous. Pre-training provides the foundation. Fine-tuning shapes the interface. RLHF polishes the behavior.

Why does this work?

This is the deep mystery. Predicting next tokens sounds too simple to produce intelligence. Yet it works.

Whether this is "real" understanding remains debated. What's undeniable: the process produces capabilities that continually surprise us.

Sources & Further Reading

๐Ÿ“„ Paper
Language Models are Few-Shot Learners
Brown et al. ยท OpenAI ยท 2020
๐ŸŽฌ Video
Let's build GPT: from scratch, in code
Andrej Karpathy ยท 2023
๐Ÿ“„ Paper