Inter-annotator agreement (IAA) tells you whether your annotation process is reliable: when two different people review the same model output, do they come to the same conclusion? High agreement suggests your guidelines are clear and your task is well-defined; low agreement suggests ambiguity in the task, the guidelines, or the data itself. Common measures include Cohen’s kappa and simple percent agreement. For behavior architects, low IAA is a signal that something needs to be fixed — either the annotation guidelines are too vague, the raters need more calibration, or the behavioral category you’re trying to label is inherently difficult to define. Chasing high IAA forces the kind of precision in behavioral definitions that makes downstream training and evaluation more trustworthy.