Adam Optimizer

Optimization

An adaptive learning rate optimization algorithm that maintains per-parameter learning rates based on first and second moment estimates of gradients.

Adam (Adaptive Moment Estimation) is one of the most widely used optimizers in deep learning, combining ideas from momentum (tracking exponential moving averages of gradients) and RMSprop (tracking exponential moving averages of squared gradients). For each parameter, Adam adapts the learning rate based on the history of that parameter's gradients.

Adam maintains two moving averages per parameter: the first moment (mean of gradients, providing momentum) and the second moment (mean of squared gradients, providing per-parameter scaling). Parameters with consistently large gradients get smaller effective learning rates, while parameters with small or sparse gradients get larger ones. This is particularly useful for problems with sparse gradients or when different parameters operate at different scales.

In practice, Adam converges faster than vanilla SGD and requires less learning rate tuning, making it the default choice for many practitioners. Common hyperparameters are learning rate 0.001, beta1 0.9, and beta2 0.999. However, Adam can sometimes generalize worse than well-tuned SGD, particularly in computer vision. Variants like AdamW (which fixes weight decay behavior) and LAMB (for large-batch training) address some of Adam's limitations.

Last updated: February 22, 2026

Adam Optimizer

Related Terms