Cross-Entropy

Optimization

A loss function that measures the difference between a model's predicted probability distribution and the true distribution, widely used for classification tasks.

Cross-entropy H(p, q) = -sum(p(x) * log(q(x))) measures the average number of bits needed to encode data from the true distribution p using a code optimized for the predicted distribution q. When the model's predictions perfectly match reality, cross-entropy equals entropy (the theoretical minimum). Any deviation increases cross-entropy.

For classification, cross-entropy loss is the standard choice because of several deep mathematical properties. It is equivalent to negative log-likelihood (maximizing the probability the model assigns to correct labels), it penalizes confident wrong predictions exponentially more than uncertain ones (-log(0.01) = 4.6 vs -log(0.5) = 0.69), and when combined with softmax, it produces elegantly simple gradients (prediction minus truth). Binary cross-entropy is the special case for two-class problems.

Cross-entropy connects information theory to machine learning: minimizing cross-entropy is equivalent to minimizing KL divergence between true and predicted distributions (since the entropy of the true distribution is constant). This gives a principled reason why cross-entropy is the right loss for classification -- it directly measures how well the model's probability estimates match the true data distribution.

Related Terms

Loss Function Softmax Gradient Descent

Last updated: February 22, 2026