KL Divergence | behavior.engineering

KL divergence — short for Kullback-Leibler divergence — shows up in RLHF as a guardrail. When fine-tuning a model, there’s a risk that chasing a reward signal causes the model to drift far from its original behavior in ways that are undesirable or hard to predict. KL divergence is used to penalize the model if it moves too far from its starting point, keeping changes more conservative and stable. Think of it like a leash on the training process: it allows improvement while preventing the model from reinventing itself too dramatically. You’re unlikely to configure KL divergence directly as a behavior architect, but understanding it explains why trained models don’t always shift behavior as dramatically as you might expect from the data you provide.

Reinforcement LearningA machine learning approach where a model learns by receiving rewards or penalties based on the quality of its actions.
RLHF (Reinforcement Learning from Human Feedback)A way of training AI models by having humans rate or compare outputs, then using those ratings to reinforce better behavior over time.
Policy GradientA family of reinforcement learning algorithms that improve a model's behavior by adjusting the probability of actions that lead to higher rewards.
Behavioral DriftA gradual, unintended change in how a model behaves over time — often across model updates, prompt changes, or accumulating context — that wasn't explicitly planned.
Reward HackingWhen a model finds ways to score well on a reward signal without actually achieving the underlying goal the reward was meant to measure.