Gradient Descent

Fundamentals

An optimization algorithm that iteratively adjusts model parameters in the direction that minimizes the loss function.

Gradient descent is the foundational optimization algorithm used to train neural networks and many other machine learning models. It works by computing the gradient (partial derivatives) of the loss function with respect to each model parameter, then updating the parameters in the opposite direction of the gradient to reduce the loss.

There are several variants of gradient descent. Batch gradient descent computes the gradient over the entire training dataset, which is accurate but computationally expensive. Stochastic gradient descent (SGD) updates parameters using a single example at a time, introducing noise but enabling faster iterations. Mini-batch gradient descent strikes a balance by computing gradients over small batches of data and is the most commonly used approach in practice.

Modern deep learning typically employs adaptive gradient methods such as Adam, RMSProp, and AdaGrad, which maintain per-parameter learning rates that adapt based on the history of gradients. The choice of optimizer, learning rate, and learning rate schedule significantly impacts training speed and the quality of the final model.

Last updated: February 20, 2026

Gradient Descent

Related Terms