Glossary
KL Divergence
A measure of how different one probability distribution is from another, used in model training to keep updated behavior from drifting too far from the original.
KL divergence — short for Kullback-Leibler divergence — shows up in RLHF as a guardrail. When fine-tuning a model, there’s a risk that chasing a reward signal causes the model to drift far from its original behavior in ways that are undesirable or hard to predict. KL divergence is used to penalize the model if it moves too far from its starting point, keeping changes more conservative and stable. Think of it like a leash on the training process: it allows improvement while preventing the model from reinventing itself too dramatically. You’re unlikely to configure KL divergence directly as a behavior architect, but understanding it explains why trained models don’t always shift behavior as dramatically as you might expect from the data you provide.