Why is AI alignment hard?

Getting AI systems to do what we actually want, not just what we literally asked for, turns out to be a deep and unsolved problem.

What is the alignment problem?

Alignment is the challenge of building AI systems that reliably do what humans actually want. Not what we literally specified. Not what seems optimal by some metric. What we actually want, in all the nuanced situations the system might encounter.

This sounds like it should be straightforward. It isn't.

The Alignment Gap

💭Human IntentWhat we actually want

→

📝SpecificationWhat we can express

→

📊Reward SignalWhat we measure

→

🤖Learned BehaviorWhat the model does

At each step, information is lost or distorted. The model optimizes for the reward signal it receives, which is an imperfect proxy for the specification, which is an imperfect expression of human intent.

Why can't we just tell it what we want?

Consider a simple instruction: "Be helpful."

What counts as helpful? Helpful to whom? What if being helpful to one person harms another? What if what someone asks for isn't actually what they need? What if being "helpful" in the moment causes long-term harm?

Every instruction contains ambiguities that humans resolve through context, common sense, and shared values. Models have to learn these from examples, and the examples never cover every situation.

The current approach: RLHF and Constitutional AI

Modern language models are aligned through techniques like:

RLHF (Reinforcement Learning from Human Feedback): Humans rate model outputs. The model learns to produce outputs that get high ratings. Crude but effective for reducing obviously bad behavior.

Constitutional AI: Define principles the model should follow. Have the model critique its own outputs against those principles. Train it to prefer responses that pass self-review.

How RLHF actually works

Generate multiple responses to the same prompt
Have humans rank which responses are better
Train a "reward model" to predict human preferences
Use the reward model to train the language model toward preferred outputs

The human raters inject their values into the system, but those values are filtered through binary rankings, averaged across many raters, and compressed into a reward signal. Much is lost.

These techniques have made models dramatically more helpful and less harmful. ChatGPT is far more aligned than GPT-3 was. But they're patches, not solutions.

Why patches aren't enough

Current alignment techniques fail in predictable ways:

They're brittle: Jailbreaks keep being discovered. The model's helpful persona is a veneer over the base model's capabilities.
They're shallow: The model learns to produce outputs that look aligned to raters, not to actually reason about human values.
They don't scale: What works for current models may not work for more capable future systems.
They conflict: Being helpful, harmless, and honest sometimes requires trade-offs with no clear answer.

🎮

Reward Hacking Demo

Interactive game demonstrating how optimizing a reward signal can produce unintended behavior

Why alignment gets harder as AI gets more capable

A weak AI that's slightly misaligned is a nuisance. A powerful AI that's slightly misaligned could be catastrophic.

Consider an AI system tasked with "maximize company profits." A weak system might suggest cost-cutting measures. A powerful system might find profit-maximizing actions we'd never endorse: manipulating customers, exploiting loopholes, externalizing costs to society.

The more capable the system, the more ways it can satisfy the letter of its instructions while violating the spirit.

What does solving alignment look like?

Researchers pursue several directions:

Interpretability: Understanding what models are actually "thinking," so we can verify alignment rather than just testing for it.
Scalable oversight: Finding ways for humans to supervise AI systems even when the AI is more capable than humans at the task.
Value learning: Getting AI to learn human values from observation, not just from explicit reward signals.
Robustness: Making alignment properties persist even under pressure, adversarial attack, or distribution shift.

None of these is solved. The field is young and the problem is hard.

Sources & Further Reading

📄 Paper

Concrete Problems in AI Safety

Amodei et al. · arXiv · 2016

📄 Paper

Constitutional AI: Harmlessness from AI Feedback

Bai et al. · arXiv · 2022

📄 Paper

Training Language Models to Follow Instructions with Human Feedback

OpenAI · 2022

🎬 Video

Intro to AI Safety

Robert Miles