Gradient Clipping

Optimization

A technique that caps gradient magnitudes during training to prevent exploding gradients from destabilizing the optimization process.

Gradient clipping limits the size of gradients during backpropagation by rescaling them when their norm exceeds a specified threshold. The most common approach, gradient norm clipping, works as follows: compute the total gradient norm, and if it exceeds the threshold, scale all gradients down proportionally so the total norm equals the threshold. This preserves gradient direction while limiting magnitude.

The technique directly addresses the exploding gradient problem that occurs in deep networks and recurrent architectures. Without clipping, a single batch with unusually large gradients can destroy hours of training progress by making enormous weight updates. Gradient clipping acts as a safety mechanism: normal gradient steps proceed unchanged, but catastrophically large gradients are reined in before they can cause damage.

In practice, gradient clipping is nearly universal in training recurrent networks and transformers. Common threshold values range from 0.5 to 5.0, with 1.0 being a frequent default. It is typically applied after computing gradients but before the optimizer step. While gradient clipping does not solve the root cause of unstable gradients (architecture or initialization issues), it provides robust protection against training divergence and is a standard component of any deep learning training pipeline.

Last updated: February 22, 2026

Gradient Clipping

Related Terms