Glossary
Alignment Science
The research field focused on developing methods, theories, and techniques for making AI systems reliably pursue intended goals and values.
Alignment science is the academic and research side of the broader alignment problem — developing the theoretical foundations and empirical methods for understanding how to make AI systems do what their designers intend, even as they become more capable and are applied to harder tasks. This includes research into reward modeling, interpretability (understanding what’s happening inside a model), scalable oversight (how humans can supervise AI that’s more capable than they are), and formal approaches to specifying goals. For behavior architects, alignment science is the upstream research that shapes the tools and techniques available to them: methods like RLHF and Constitutional AI originated in alignment research and became practical tools that practitioners now use daily.