Constitutional AI | behavior.engineering

Constitutional AI treats behavior as something that can be grounded in explicit, human-readable principles — a “constitution” — rather than inferred solely from individual human ratings. The model is trained to ask itself whether a given response violates any of these principles, then revise if it does. This makes the training process more transparent and consistent: instead of hoping that thousands of individual rater decisions happen to encode the right values, you can write down what you believe and test whether the model is actually following it. For behavior architects, it’s a concrete example of how behavioral intent can be operationalized through documented principles rather than vibes.

RLAIF (Reinforcement Learning from AI Feedback)A variation of RLHF where another AI model provides the preference judgments instead of human raters.
RLHF (Reinforcement Learning from Human Feedback)A way of training AI models by having humans rate or compare outputs, then using those ratings to reinforce better behavior over time.
Value AlignmentThe degree to which a model's behavior reflects human values, intentions, and goals rather than optimizing for narrow objectives that miss the point.
Harm AvoidanceThe practice of designing model behavior to minimize the risk of producing outputs that cause physical, psychological, social, or financial harm.
Behavioral SpecificationA written document or set of guidelines that defines how a model is expected to behave across different situations.