Preference Learning | behavior.engineering

Instead of telling a model “this is the correct answer,” preference learning works by showing it pairs of responses and recording which one humans (or another model) prefer. This is often more natural than labeling from scratch — it’s easier for people to say “this response is better than that one” than to write the ideal response themselves. Over time, the model learns to produce outputs that match the direction of those preferences. Preference learning is a core mechanism in RLHF and shapes many of the behavioral qualities users notice in modern AI assistants, from tone and length to how cautiously or openly a model handles sensitive topics.