LLMs learn by predicting text billions of times, adjusting their parameters to make better predictions. This simple process, at massive scale, produces remarkable capabilities.
How do you teach a neural network language?
You show it text. An enormous amount of text. And you ask it, over and over: what word comes next?
The model starts with random parameters (billions of numbers that mean nothing). Its predictions are gibberish. But each wrong prediction provides feedback. The parameters adjust slightly. Over billions of examples, the adjustments accumulate into something that understands language.
This is pre-training: the massive first phase where an LLM learns the patterns of language itself.
The training loop
Training follows a simple cycle:
The Training Loop
๐ฆSample BatchGet text
โ
โก๏ธForward PassMake predictions
โ
๐Compute LossHow wrong?
โ
โฌ ๏ธBackward PassFind gradients
โ
๐UpdateAdjust params
Each iteration improves the model imperceptibly. But imperceptible improvements compound. After processing trillions of tokens, the model has learned patterns no human could articulate.
What data do they train on?
Scale requires enormous datasets:
Web crawls: Common Crawl provides billions of web pages
Books: Digital libraries and published text
Code: GitHub repositories and documentation
Conversations: Reddit, forums, dialogue datasets
Academic papers: Scientific literature across domains
Wikipedia: Encyclopedic knowledge
A typical large model trains on trillions of tokens from diverse sources. The diversity matters: it's why LLMs can discuss Shakespeare and Python, cooking and quantum physics. The model sees the full breadth of what humans write about.
What is "loss" and why minimize it?
Loss measures how wrong the model's predictions are. For LLMs, this is typically cross-entropy loss: roughly, how surprised was the model by the actual next token?
If the model predicted "mat" with 90% probability and the actual word was "mat," loss is low. If it predicted "mat" with 1% probability, loss is high.
Training minimizes average loss across all predictions. Lower loss means the model assigns higher probability to what actually came next. It's becoming a better predictor.
The compute required
Training large models requires staggering resources:
GPT-3 (175B parameters): Thousands of GPU-years of computation
Frontier models: Tens to hundreds of millions of dollars in compute
Training time: Weeks to months on thousands of specialized chips
Power consumption: Equivalent to small towns
This is why only a few organizations train frontier models. The capital requirements are prohibitive. Most practitioners use pre-trained models rather than training from scratch.
Distributed training across many machines
No single machine can train a large model. The solution: distributed training across hundreds or thousands of machines.
Data parallelism: Each machine processes different batches, computes gradients, and they're averaged together.
Model parallelism: The model itself is split across machines. Different layers live on different chips.
Pipeline parallelism: Different stages of forward and backward passes run simultaneously on different machines.
Coordinating this requires careful engineering. After each batch, machines share gradients and synchronize parameter updates. Network bandwidth becomes a bottleneck. Machines wait for each other. Techniques like gradient compression, asynchronous updates, and clever batching minimize waiting. It's as much a distributed systems problem as a machine learning one.
Stages of training
Modern LLMs typically go through multiple phases:
Pre-training: Massive text prediction on diverse data. Builds general language understanding and factual knowledge.
Supervised fine-tuning (SFT): Training on curated instruction-response pairs. Teaches the model to follow instructions and produce helpful responses.
Reinforcement learning from human feedback (RLHF): Training on human preferences about response quality. Aligns the model with what users actually want.
Each stage builds on the previous. Pre-training provides the foundation. Fine-tuning shapes the interface. RLHF polishes the behavior.
Why does this work?
This is the deep mystery. Predicting next tokens sounds too simple to produce intelligence. Yet it works.
Whether this is "real" understanding remains debated. What's undeniable: the process produces capabilities that continually surprise us.