Human evaluation involves real people — whether trained annotators, domain experts, or end users — reviewing and rating model outputs. It remains the gold standard for subjective qualities like tone, helpfulness, and appropriateness, because human judgment picks up on nuances that automated systems routinely miss. The tradeoff is cost and scale: human evaluation is slow and expensive compared to automated alternatives. For this reason, it’s often used selectively — to calibrate automated evaluators, to investigate specific failure modes, or to make high-stakes decisions about model updates. For behavior architects, human evaluation is irreplaceable for validating that your automated evaluations are actually tracking what you care about.