The Kullback-Leibler divergence measures how much probability mass one distribution places where another does not. It is zero only when the two distributions match, and it is not symmetric: which distribution you put first changes the answer, and changes what a model trained to minimize it will do.
That asymmetry matters in practice. Minimizing forward KL (target first) produces mode-covering behaviour, where the model spreads probability over everything the target does. Minimizing reverse KL (model first) produces mode-seeking behaviour, where the model commits to the parts of the target it can represent well. Knowledge distillation, variational inference, and RLHF regularization all hinge on this choice of direction.
