LLM-as-Judge | behavior.engineering

LLM-as-Judge is an approach where a capable language model — often a frontier model like GPT-4 or Claude — is prompted to assess the quality, safety, or accuracy of another model’s outputs. It’s become popular because it scales cheaply, can be applied consistently, and can handle nuanced language better than rule-based systems. For example, you might prompt a judge model to rate whether a response is appropriately cautious or unnecessarily evasive. The main limitation is that LLM judges can share the biases and blind spots of the models that produced them — they tend to prefer longer, more confident-sounding responses and may miss subtle factual errors. Validating LLM-as-Judge outputs against human evaluation is essential before relying on them heavily.