Glossary
Labeling
Assigning categories, tags, or classifications to data examples to indicate what they represent or how a model should treat them.
Labeling is a specific form of annotation where the goal is to classify or tag data — for example, marking a response as “safe” or “unsafe,” tagging it with a topic category, or labeling a conversation turn as a refusal. Labels are the building blocks of training datasets and evaluation sets alike. The challenge with labeling is that even straightforward-seeming categories can have edge cases where reasonable people disagree. For behavior architects, investing in label definitions — writing them precisely, testing them with real examples, and updating them as you find gaps — is often more valuable than finding more data to label.