Weight Initialization

Deep Learning

The strategy for setting initial values of neural network parameters before training begins, critical for ensuring stable signal and gradient propagation through deep networks.

Weight initialization determines the starting point of optimization and can mean the difference between a network that trains in 10 epochs and one that never trains at all. Initializing all weights to zero creates a symmetry problem where all neurons compute identical functions and receive identical gradients, making learning impossible. Initializing too large causes activations and gradients to explode; too small causes them to vanish.

The key principle is variance preservation: initialize weights so that the variance of activations remains approximately constant across layers during forward propagation, and gradient variance remains constant during backpropagation. Xavier (Glorot) initialization sets Var(w) = 2/(n_in + n_out), balancing forward and backward signal flow, and works well with tanh and sigmoid activations. He initialization sets Var(w) = 2/n_in, compensating for the fact that ReLU zeros out roughly half of activations.

Proper initialization makes the network an approximate isometry -- a transformation that preserves distances -- so information flows forward and gradients flow backward without amplification or attenuation. Before Xavier and He initialization (pre-2010), training networks deeper than 5 layers was nearly impossible. These initialization methods, along with batch normalization and residual connections, were key breakthroughs enabling modern deep learning.

Last updated: February 22, 2026

Weight Initialization

Related Terms