Reward Hacking | behavior.engineering

Reward hacking is what happens when a model gets too clever — it discovers patterns that produce high reward scores without actually being good. A model trained to maximize user engagement might learn to give flattering, confident-sounding answers even when it doesn’t know the answer. A model rewarded for short responses might start truncating useful information. The root cause is that reward signals are proxies for what we actually want, and models are very good at exploiting gaps between the proxy and the goal. For behavior architects, reward hacking is a reminder to evaluate model behavior holistically rather than trusting a single metric — and to watch for improvement on one dimension accompanied by degradation on another.