Categorical Thinking | behavior.engineering

Categorical thinking is the approach of defining fixed categories — “safe,” “unsafe,” “allowed,” “prohibited” — and applying them consistently, rather than reasoning about every situation from scratch. In AI safety and content policy, categorical approaches are valuable because they’re predictable and hard to manipulate: a model that never helps with certain request types, regardless of context, is more robust to adversarial arguments than one that weighs every case individually. The limitation is that the real world doesn’t map neatly onto categories — the same request can be benign or harmful depending on context that categories can’t always capture. For behavior architects, the art is knowing when categorical rules serve users and when contextual judgment is required, and designing systems that use each approach appropriately.