Elo Ranking | behavior.engineering

Elo ranking, originally developed for chess ratings, has been adapted for AI evaluation as a way to compare models or outputs without requiring an absolute score. In practice, two responses are shown side-by-side and a rater picks the winner; over many such matchups, each model accumulates a score that reflects its overall competitive standing. Chatbot Arena, a widely used public benchmark, uses this approach to rank frontier models based on millions of user preference votes. For behavior architects, Elo-style ranking is useful because it sidesteps the difficulty of assigning absolute quality scores — people often find it much easier to say “this one is better” than to rate something on a scale from 1 to 5.