Preference-based evaluation asks evaluators to make a relative judgment — “which of these two responses is better?” — rather than an absolute one. This approach tends to be more reliable than rating scales, because it’s easier for people to agree on a winner than on an exact numeric score. It’s the evaluation method behind RLHF and platforms like Chatbot Arena. The downside is that it tells you which option is better relative to the other, but not how good either one is in absolute terms — a model can win every comparison and still be doing poorly by some objective standard. For behavior architects, preference-based evaluation is a practical and high-signal method for comparing variants, especially when absolute quality benchmarks are hard to define.