KL Divergence

Fundamentals

A measure of how one probability distribution differs from a reference distribution, quantifying the information lost when approximating one distribution with another.

Kullback-Leibler (KL) divergence D_KL(p || q) = sum(p(x) * log(p(x)/q(x))) measures the extra bits needed to encode data from distribution p using a code optimized for distribution q. It is always non-negative and equals zero only when the two distributions are identical. Importantly, it is not symmetric: D_KL(p || q) is not equal to D_KL(q || p).

KL divergence connects to cross-entropy through the identity: H(p, q) = H(p) + D_KL(p || q). Since entropy H(p) of the true distribution is constant during training, minimizing cross-entropy loss is equivalent to minimizing KL divergence -- making the model's predictions match reality. This gives a principled information-theoretic justification for why cross-entropy is the standard classification loss.

KL divergence appears throughout machine learning: as a regularization term in Variational Autoencoders (encouraging the latent distribution to match a prior), in policy optimization for reinforcement learning (constraining how much a policy changes between updates), in knowledge distillation (measuring how well a student model mimics a teacher), and in drift detection (measuring how much input distributions have shifted from the training distribution).

Last updated: February 22, 2026

KL Divergence

Related Terms