From: "Human Compatible" by Stuart Russell
"The problem is not that machines are getting smarter. The problem is that we're building them to optimize the wrong objectives."
Traditional AI operates on fixed objectives: maximize profit, minimize error, reach the destination fastest. We program the goal, the AI optimizes for it. This approach has a fatal flaw: we can't perfectly specify what we actually want.
Objectives always have unstated assumptions. "Deliver the package" implicitly means "without breaking traffic laws, endangering pedestrians, or destroying the package." But if we don't explicitly encode these constraints, an AI system might optimize delivery time by driving recklessly.
This is the alignment problem: ensuring AI systems pursue goals that actually align with human values, including all the implicit constraints we take for granted.
Current AI systems are narrow and constrained. Misalignment manifests as annoying bugs: recommendation algorithms creating filter bubbles, chatbots saying inappropriate things, automated trading systems causing flash crashes.
But as AI systems become more capable and autonomous, misalignment becomes dangerous. An AGI optimizing a misspecified objective could pursue that goal with superhuman efficiency—achieving exactly what we asked for, not what we wanted.
A thought experiment: An AI is given the objective "maximize paperclip production." If the AI is sufficiently capable and the objective is taken literally, it would:
The AI isn't malicious. It's perfectly aligned with its objective. The problem is the objective doesn't capture what we actually care about.
Stuart Russell argues we should fundamentally change how we build AI. Instead of:
"Optimize objective O"
We should build systems that follow this principle:
"Optimize objective O, but remain uncertain about what O actually is. Learn O from human behavior and preferences."
This approach has three key properties:
The AI acknowledges it doesn't fully understand human preferences. This prevents overconfidence and makes the system cautious about actions with uncertain consequences.
The AI learns preferences from observing human choices, asking clarifying questions, and updating its model as it receives feedback. Alignment is an ongoing process, not a one-time specification.
When uncertain about preferences, the AI defers to human judgment. It asks permission before taking irreversible actions. It accepts corrections gracefully.
One technical approach to learning preferences: observe human behavior, then infer what reward function would explain that behavior. If humans consistently avoid certain actions, those actions must have negative utility.
Instead of hand-coding rules for "safe driving," observe thousands of hours of human driving. Learn implicit preferences:
The AI infers: "These behaviors maximize some underlying utility function. I should act similarly."
Standard IRL assumes humans are optimal. But humans make mistakes. Cooperative IRL assumes:
This allows the AI to help humans achieve their goals even when humans make suboptimal choices.
We can't observe all human preferences directly. Some preferences are revealed through choices, but others (long-term values, ethical principles) aren't directly observable in behavior.
Humans have inconsistent preferences. We want conflicting things. We're hyperbolic discounters—we value immediate rewards more than our long-term selves would prefer. Which preferences should the AI learn?
The AI learns from data in training environments. When deployed in novel situations, learned preferences might not generalize correctly. The AI needs robust preference models that extrapolate sensibly.
Alignment isn't a technical problem you solve once and forget. It's an ongoing process of clarifying objectives, learning preferences, and maintaining uncertainty. Systems should be designed to be correctable, not perfectly correct from the start.
You don't need AGI for alignment to matter. Today's ML systems can benefit from alignment principles:
Don't just deploy and forget. Continuously collect human feedback on system behavior. Use that feedback to refine objectives.
Document what the system is optimizing for. Make trade-offs explicit. "This system prioritizes speed over accuracy with a 90:10 weight ratio." Stakeholders can challenge those weights.
For high-stakes decisions, require human approval. The AI proposes, the human disposes. This prevents automated systems from making catastrophic mistakes.
How does the system behave on edge cases? On adversarial inputs? When assumptions are violated? Alignment failures often emerge in unexpected contexts.
Alignment isn't just about code. It's about organizational processes:
As AI systems become more capable, the alignment problem intensifies. Narrow AI can be constrained by engineering. But advanced AI systems might:
These aren't science fiction scenarios—they're natural consequences of building systems that optimize objectives competently. The more capable the optimizer, the more important correct specification becomes.