Glossary
Reward Modeling
Training a separate model to predict human preferences so it can be used to score outputs during reinforcement learning.
In RLHF, you can’t ask a human to rate every output the model ever produces during training — there are too many. Instead, you train a reward model: a separate system that learns to predict what humans would prefer, based on a set of human comparisons. That reward model then acts as an automated judge during training, scoring the main model’s outputs and steering it toward higher-rated behavior. The challenge is that reward models are imperfect proxies for human judgment — and if a model finds ways to score well on the reward model without actually being good, you get reward hacking. For behavior architects, reward modeling is a reminder that what you measure shapes what you get.