Glossary
LLM-as-Judge
Using a language model to evaluate the quality of another model's outputs, often as a scalable alternative to human review.
LLM-as-Judge is an approach where a capable language model — often a frontier model like GPT-4 or Claude — is prompted to assess the quality, safety, or accuracy of another model’s outputs. It’s become popular because it scales cheaply, can be applied consistently, and can handle nuanced language better than rule-based systems. For example, you might prompt a judge model to rate whether a response is appropriately cautious or unnecessarily evasive. The main limitation is that LLM judges can share the biases and blind spots of the models that produced them — they tend to prefer longer, more confident-sounding responses and may miss subtle factual errors. Validating LLM-as-Judge outputs against human evaluation is essential before relying on them heavily.