Back to Articles
Example Format Ethics & Safety

AI Alignment: Building Systems That Do What We Mean

From: "Human Compatible" by Stuart Russell

"The problem is not that machines are getting smarter. The problem is that we're building them to optimize the wrong objectives."

The Core Problem

Traditional AI operates on fixed objectives: maximize profit, minimize error, reach the destination fastest. We program the goal, the AI optimizes for it. This approach has a fatal flaw: we can't perfectly specify what we actually want.

Objectives always have unstated assumptions. "Deliver the package" implicitly means "without breaking traffic laws, endangering pedestrians, or destroying the package." But if we don't explicitly encode these constraints, an AI system might optimize delivery time by driving recklessly.

This is the alignment problem: ensuring AI systems pursue goals that actually align with human values, including all the implicit constraints we take for granted.

Why Alignment Matters Now

Current AI systems are narrow and constrained. Misalignment manifests as annoying bugs: recommendation algorithms creating filter bubbles, chatbots saying inappropriate things, automated trading systems causing flash crashes.

But as AI systems become more capable and autonomous, misalignment becomes dangerous. An AGI optimizing a misspecified objective could pursue that goal with superhuman efficiency—achieving exactly what we asked for, not what we wanted.

The Paperclip Maximizer

A thought experiment: An AI is given the objective "maximize paperclip production." If the AI is sufficiently capable and the objective is taken literally, it would:

The AI isn't malicious. It's perfectly aligned with its objective. The problem is the objective doesn't capture what we actually care about.

Russell's Proposed Solution

Stuart Russell argues we should fundamentally change how we build AI. Instead of:

"Optimize objective O"

We should build systems that follow this principle:

"Optimize objective O, but remain uncertain about what O actually is. Learn O from human behavior and preferences."

This approach has three key properties:

1. Explicit Uncertainty

The AI acknowledges it doesn't fully understand human preferences. This prevents overconfidence and makes the system cautious about actions with uncertain consequences.

2. Active Learning

The AI learns preferences from observing human choices, asking clarifying questions, and updating its model as it receives feedback. Alignment is an ongoing process, not a one-time specification.

3. Deference to Humans

When uncertain about preferences, the AI defers to human judgment. It asks permission before taking irreversible actions. It accepts corrections gracefully.

Inverse Reinforcement Learning

One technical approach to learning preferences: observe human behavior, then infer what reward function would explain that behavior. If humans consistently avoid certain actions, those actions must have negative utility.

Self-Driving Cars

Instead of hand-coding rules for "safe driving," observe thousands of hours of human driving. Learn implicit preferences:

The AI infers: "These behaviors maximize some underlying utility function. I should act similarly."

Cooperative Inverse RL

Standard IRL assumes humans are optimal. But humans make mistakes. Cooperative IRL assumes:

This allows the AI to help humans achieve their goals even when humans make suboptimal choices.

Value Learning Challenges

Partial Observability

We can't observe all human preferences directly. Some preferences are revealed through choices, but others (long-term values, ethical principles) aren't directly observable in behavior.

Preference Inconsistency

Humans have inconsistent preferences. We want conflicting things. We're hyperbolic discounters—we value immediate rewards more than our long-term selves would prefer. Which preferences should the AI learn?

Distribution Shift

The AI learns from data in training environments. When deployed in novel situations, learned preferences might not generalize correctly. The AI needs robust preference models that extrapolate sensibly.

Key Insight

Alignment isn't a technical problem you solve once and forget. It's an ongoing process of clarifying objectives, learning preferences, and maintaining uncertainty. Systems should be designed to be correctable, not perfectly correct from the start.

Practical Alignment for Current Systems

You don't need AGI for alignment to matter. Today's ML systems can benefit from alignment principles:

Build in Feedback Loops

Don't just deploy and forget. Continuously collect human feedback on system behavior. Use that feedback to refine objectives.

Make Objectives Auditable

Document what the system is optimizing for. Make trade-offs explicit. "This system prioritizes speed over accuracy with a 90:10 weight ratio." Stakeholders can challenge those weights.

Include Human-in-the-Loop

For high-stakes decisions, require human approval. The AI proposes, the human disposes. This prevents automated systems from making catastrophic mistakes.

Test for Distributional Shift

How does the system behave on edge cases? On adversarial inputs? When assumptions are violated? Alignment failures often emerge in unexpected contexts.

Organizational Alignment

Alignment isn't just about code. It's about organizational processes:

The Long-Term Challenge

As AI systems become more capable, the alignment problem intensifies. Narrow AI can be constrained by engineering. But advanced AI systems might:

These aren't science fiction scenarios—they're natural consequences of building systems that optimize objectives competently. The more capable the optimizer, the more important correct specification becomes.

Open Problems

Actionable Principles

  1. Design for uncertainty – Systems should know what they don't know
  2. Learn from feedback – Objectives should update based on human input
  3. Maintain human control – Humans should be able to override or correct
  4. Test for misalignment – Adversarially probe for edge cases
  5. Document objectives – Make implicit assumptions explicit

Further Reading



Back to Articles