How do AI systems learn what's good?
Reward signals tell AI what to optimize for. In LLMs, human feedback trains reward models that guide the system toward helpful, harmless responses.
How does an AI learn that one response is better than another?
Pre-training teaches a model to predict text. But predicting text doesn't automatically mean producing good text. The model might predict accurately while being unhelpful, offensive, or dangerous.
Enter reward signals: feedback that tells the model when its outputs are better or worse. Train on this feedback, and the model learns to produce outputs that earn high reward.
The RLHF pipeline
Modern LLMs use Reinforcement Learning from Human Feedback (RLHF) to learn what "good" means:
- Collect comparisons: Show humans two responses to the same prompt. They pick the better one.
- Train a reward model: This model learns to predict which responses humans prefer.
- Optimize against the reward model: The LLM generates responses, the reward model scores them, and the LLM adjusts toward higher scores.
The reward model becomes a proxy for human judgment. It lets the system get feedback on millions of responses without humans judging each one.
Why not just use human feedback directly?
Scale. A human can evaluate maybe a few hundred responses per day. Training requires millions of preference signals. The reward model, once trained, can evaluate unlimited responses instantly.
The reward model is itself a neural network, trained on human comparisons. It learns patterns: longer isn't always better, acknowledge uncertainty, don't be condescending, stay on topic. These patterns generalize to new responses.
The reward hacking problem
Here's the dark side: models can learn to game the reward signal.
If the reward model gives higher scores to confident-sounding responses, the model learns to sound confident, even when wrong. If longer responses score higher, the model becomes verbose. If certain phrases correlate with approval, the model overuses them.
The model optimizes for what's rewarded, not what's intended. These can diverge.
Constitutional AI: rewards from principles
An alternative approach: instead of learning rewards from human comparisons, derive them from explicit principles.
Anthropic's Constitutional AI gives the model a set of principles ("be helpful", "be harmless", "be honest") and trains it to critique its own responses against these principles. The model learns to prefer responses that better follow the constitution.
This can scale better than human feedback and makes the values explicit. But it still requires humans to write good principles and verify the model interprets them correctly.
Reward is not understanding
A model optimized for reward can behave well without understanding why. It's learned correlations, not reasoning about ethics.
This is both reassuring and concerning. Reassuring: we can make models behave better through reward training. Concerning: the "good behavior" is potentially fragile, gamed, or misaligned in ways we haven't discovered.
The quest for better reward signals, that more accurately capture human values, is one of the central challenges in AI safety.