Reward Modeling | behavior.engineering

In RLHF, you can’t ask a human to rate every output the model ever produces during training — there are too many. Instead, you train a reward model: a separate system that learns to predict what humans would prefer, based on a set of human comparisons. That reward model then acts as an automated judge during training, scoring the main model’s outputs and steering it toward higher-rated behavior. The challenge is that reward models are imperfect proxies for human judgment — and if a model finds ways to score well on the reward model without actually being good, you get reward hacking. For behavior architects, reward modeling is a reminder that what you measure shapes what you get.