Policy Gradient | behavior.engineering

In reinforcement learning, the “policy” is the model’s strategy for choosing what to output — and policy gradient methods are techniques for nudging that strategy in the direction of higher rewards. Put simply: if a particular type of response gets a good reward signal, the model learns to produce that kind of response more often. Proximal Policy Optimization (PPO) is the most common policy gradient method used in RLHF. For behavior architects who aren’t deeply technical, the key insight is that policy gradient methods are sensitive to the reward signal: small changes to how rewards are defined can have outsized effects on what the model learns to do, which is why reward hacking and alignment are persistent concerns.