Glossary Entry

KL Divergence

A measure of how one probability distribution differs from another; asymmetric, so the direction you compute it in changes what an optimizer learns.

Optimization Models

Also called: Kullback-Leibler divergence, KL divergence, forward KL, reverse KL

Seed source: Wikipedia

The Kullback-Leibler divergence measures how much probability mass one distribution places where another does not. It is zero only when the two distributions match, and it is not symmetric: which distribution you put first changes the answer, and changes what a model trained to minimize it will do.

That asymmetry matters in practice. Minimizing forward KL (target first) produces mode-covering behaviour, where the model spreads probability over everything the target does. Minimizing reverse KL (model first) produces mode-seeking behaviour, where the model commits to the parts of the target it can represent well. Knowledge distillation, variational inference, and RLHF regularization all hinge on this choice of direction.