Glossary
Elo Ranking
A system for ranking models or outputs by their win rate in head-to-head comparisons, borrowed from competitive chess.
Elo ranking, originally developed for chess ratings, has been adapted for AI evaluation as a way to compare models or outputs without requiring an absolute score. In practice, two responses are shown side-by-side and a rater picks the winner; over many such matchups, each model accumulates a score that reflects its overall competitive standing. Chatbot Arena, a widely used public benchmark, uses this approach to rank frontier models based on millions of user preference votes. For behavior architects, Elo-style ranking is useful because it sidesteps the difficulty of assigning absolute quality scores — people often find it much easier to say “this one is better” than to rate something on a scale from 1 to 5.