Glossary
RLAIF (Reinforcement Learning from AI Feedback)
A variation of RLHF where another AI model provides the preference judgments instead of human raters.
RLAIF replaces (or supplements) human raters with an AI evaluator — typically a more capable model that judges whether a given response is better or worse. This makes the feedback process faster and cheaper to scale, since you don’t need humans to review every example. A common version of this is Constitutional AI from Anthropic, where a model critiques and revises its own outputs according to a set of written principles. For behavior architects, RLAIF matters because it shifts some of the responsibility for behavioral shaping away from individual human raters and toward the values encoded in the evaluating model — which raises its own questions about bias, consistency, and what “good” behavior actually means.