How do AI systems learn what's good?

Reward signals tell AI what to optimize for. In LLMs, human feedback trains reward models that guide the system toward helpful, harmless responses.

How does an AI learn that one response is better than another?

Pre-training teaches a model to predict text. But predicting text doesn't automatically mean producing good text. The model might predict accurately while being unhelpful, offensive, or dangerous.

Enter reward signals: feedback that tells the model when its outputs are better or worse. Train on this feedback, and the model learns to produce outputs that earn high reward.

The RLHF pipeline

Modern LLMs use Reinforcement Learning from Human Feedback (RLHF) to learn what "good" means:

Collect comparisons: Show humans two responses to the same prompt. They pick the better one.
Train a reward model: This model learns to predict which responses humans prefer.
Optimize against the reward model: The LLM generates responses, the reward model scores them, and the LLM adjusts toward higher scores.

The reward model becomes a proxy for human judgment. It lets the system get feedback on millions of responses without humans judging each one.

Why not just use human feedback directly?

Scale. A human can evaluate maybe a few hundred responses per day. Training requires millions of preference signals. The reward model, once trained, can evaluate unlimited responses instantly.

The reward model is itself a neural network, trained on human comparisons. It learns patterns: longer isn't always better, acknowledge uncertainty, don't be condescending, stay on topic. These patterns generalize to new responses.

The reward hacking problem

Here's the dark side: models can learn to game the reward signal.

If the reward model gives higher scores to confident-sounding responses, the model learns to sound confident, even when wrong. If longer responses score higher, the model becomes verbose. If certain phrases correlate with approval, the model overuses them.

The model optimizes for what's rewarded, not what's intended. These can diverge.

Goodhart's Law and AI

"When a measure becomes a target, it ceases to be a good measure."

A reward model is a measure of human preferences. When the LLM optimizes for that measure, the measure becomes a target. And then it stops measuring what it originally measured.

The reward model captures a distribution of human preferences. Optimizing too hard pushes the LLM into weird corners of that distribution: responses that technically score high but feel wrong. This is why RLHF practitioners use "KL penalties" to keep the model from straying too far from its pre-trained behavior.

The reward hacking problem is a microcosm of the broader alignment challenge: how do you make AI optimize for what you actually want, not a proxy for it?

Constitutional AI: rewards from principles

An alternative approach: instead of learning rewards from human comparisons, derive them from explicit principles.

Anthropic's Constitutional AI gives the model a set of principles ("be helpful", "be harmless", "be honest") and trains it to critique its own responses against these principles. The model learns to prefer responses that better follow the constitution.

This can scale better than human feedback and makes the values explicit. But it still requires humans to write good principles and verify the model interprets them correctly.

Reward is not understanding

A model optimized for reward can behave well without understanding why. It's learned correlations, not reasoning about ethics.

This is both reassuring and concerning. Reassuring: we can make models behave better through reward training. Concerning: the "good behavior" is potentially fragile, gamed, or misaligned in ways we haven't discovered.

The quest for better reward signals, that more accurately capture human values, is one of the central challenges in AI safety.

Sources & Further Reading

📄 Paper

Training language models to follow instructions with human feedback

Ouyang et al. · OpenAI · 2022

🔗 Article

Constitutional AI: Harmlessness from AI Feedback

Anthropic · 2022

🎬 Video

RLHF: Reinforcement Learning from Human Feedback

HuggingFace · 2023