Alignment Tax | behavior.engineering

Making a model safer or more aligned with human values can sometimes make it less capable or more hesitant — this tradeoff is called the alignment tax. An overly cautious model that refuses borderline requests may be safer, but it’s also less useful, which has real costs for users. Early versions of aligned models were often criticized for being excessively hedging or unhelpful compared to their unaligned counterparts. The goal of modern alignment research and behavior design is to minimize this tax — showing that safety and helpfulness can coexist, and that good behavioral design doesn’t require sacrificing capability. For behavior architects, the alignment tax is a useful frame when communicating tradeoffs to stakeholders who want both maximum capability and maximum safety.