Softmax

Deep Learning

A function that converts a vector of real numbers into a probability distribution, where each output is between 0 and 1 and all outputs sum to 1.

The softmax function takes a vector of arbitrary real-valued scores (logits) and transforms them into probabilities: softmax(x_i) = exp(x_i) / sum(exp(x_j)). It is used as the final layer in classification networks to produce class probabilities and as a normalization step in the attention mechanism of transformers.

Softmax amplifies differences through exponentiation: larger values get disproportionately more probability mass. For example, logits [5, 4, 0] become approximately [0.73, 0.27, 0.005] -- the highest score dominates. This sparsifying behavior is controlled by a temperature parameter: dividing logits by temperature tau before softmax adjusts sharpness. Low temperature (tau approaching 0) makes the output nearly one-hot, while high temperature (tau approaching infinity) makes it nearly uniform.

In transformers, softmax converts attention scores into attention weights (probability distributions over which tokens to attend to). The scaling factor 1/sqrt(d_k) in scaled dot-product attention prevents logits from growing too large with high dimensionality, which would cause softmax to saturate and produce near-zero gradients. Understanding softmax is essential for working with both classification models and modern transformer architectures.

Last updated: February 22, 2026

Softmax

Related Terms