How are LLMs trained?

LLMs learn by predicting text billions of times, adjusting their parameters to make better predictions. This simple process, at massive scale, produces remarkable capabilities.

How do you teach a neural network language?

You show it text. An enormous amount of text. And you ask it, over and over: what word comes next?

The model starts with random parameters (billions of numbers that mean nothing). Its predictions are gibberish. But each wrong prediction provides feedback. The parameters adjust slightly. Over billions of examples, the adjustments accumulate into something that understands language.

This is pre-training: the massive first phase where an LLM learns the patterns of language itself.

The training loop

Training follows a simple cycle:

The Training Loop

📦Sample BatchGet text

→

➡️Forward PassMake predictions

→

📉Compute LossHow wrong?

→

⬅️Backward PassFind gradients

→

🔄UpdateAdjust params

Each iteration improves the model imperceptibly. But imperceptible improvements compound. After processing trillions of tokens, the model has learned patterns no human could articulate.

What data do they train on?

Scale requires enormous datasets:

Web crawls: Common Crawl provides billions of web pages
Books: Digital libraries and published text
Code: GitHub repositories and documentation
Conversations: Reddit, forums, dialogue datasets
Academic papers: Scientific literature across domains
Wikipedia: Encyclopedic knowledge

A typical large model trains on trillions of tokens from diverse sources. The diversity matters: it's why LLMs can discuss Shakespeare and Python, cooking and quantum physics. The model sees the full breadth of what humans write about.

What is "loss" and why minimize it?

Loss measures how wrong the model's predictions are. For LLMs, this is typically cross-entropy loss: roughly, how surprised was the model by the actual next token?

If the model predicted "mat" with 90% probability and the actual word was "mat," loss is low. If it predicted "mat" with 1% probability, loss is high.

Training minimizes average loss across all predictions. Lower loss means the model assigns higher probability to what actually came next. It's becoming a better predictor.

The compute required

Training large models requires staggering resources:

GPT-3 (175B parameters): Thousands of GPU-years of computation
Frontier models: Tens to hundreds of millions of dollars in compute
Training time: Weeks to months on thousands of specialized chips
Power consumption: Equivalent to small towns

This is why only a few organizations train frontier models. The capital requirements are prohibitive. Most practitioners use pre-trained models rather than training from scratch.

Distributed training across many machines

No single machine can train a large model. The solution: distributed training across hundreds or thousands of machines.

Data parallelism: Each machine processes different batches, computes gradients, and they're averaged together.

Model parallelism: The model itself is split across machines. Different layers live on different chips.

Pipeline parallelism: Different stages of forward and backward passes run simultaneously on different machines.

Coordinating this requires careful engineering. After each batch, machines share gradients and synchronize parameter updates. Network bandwidth becomes a bottleneck. Machines wait for each other. Techniques like gradient compression, asynchronous updates, and clever batching minimize waiting. It's as much a distributed systems problem as a machine learning one.

Stages of training

Modern LLMs typically go through multiple phases:

Pre-training: Massive text prediction on diverse data. Builds general language understanding and factual knowledge.

Supervised fine-tuning (SFT): Training on curated instruction-response pairs. Teaches the model to follow instructions and produce helpful responses.

Reinforcement learning from human feedback (RLHF): Training on human preferences about response quality. Aligns the model with what users actually want.

Each stage builds on the previous. Pre-training provides the foundation. Fine-tuning shapes the interface. RLHF polishes the behavior.

Pre-training is quarrying a massive block of marble. Billions of parameters start as raw stone: random numbers without pattern. Training is the chisel. Each prediction error is a tap that removes a tiny flake. Trillion taps later, something emerges. Not because the sculptor had a blueprint, but because the stone itself had latent structure, and the taps revealed it.

Fine-tuning is carving. Specialized training data sculpts the general shape into something specific: an assistant, a coder, a particular persona.

RLHF is polishing. Human feedback smooths rough edges, adjusts the expression, refines the final appearance.

The foundation must be solid. You can't polish a pebble into a masterpiece. You can't fine-tune your way to capabilities that weren't latent in the pre-trained model.

Why does this work?

This is the deep mystery. Predicting next tokens sounds too simple to produce intelligence. Yet it works.

Whether this is "real" understanding remains debated. What's undeniable: the process produces capabilities that continually surprise us.

Sources & Further Reading

📄 Paper

Language Models are Few-Shot Learners

Brown et al. · OpenAI · 2020

🎬 Video

Let's build GPT: from scratch, in code

Andrej Karpathy · 2023

📄 Paper

Training language models to follow instructions with human feedback

Ouyang et al. · OpenAI · 2022