Sigmoid
Deep LearningAn S-shaped activation function that maps any real number to a value between 0 and 1, historically important but largely replaced by ReLU in hidden layers.
The sigmoid function sigma(x) = 1/(1 + e^(-x)) squashes any input into the range (0, 1), making it useful for producing probability-like outputs. It was one of the earliest activation functions used in neural networks and is still used in the output layer for binary classification and in gating mechanisms (like LSTMs).
However, sigmoid has significant drawbacks for hidden layers in deep networks. It saturates for large positive or negative inputs (sigma'(x) approaches 0), causing the vanishing gradient problem: gradients become extremely small in deep networks, making early layers nearly impossible to train. Additionally, sigmoid outputs are not zero-centered (always positive), which can cause inefficient zigzag gradient updates.
For these reasons, ReLU and its variants have largely replaced sigmoid in hidden layers of modern deep networks. Sigmoid remains important in specific contexts: binary classification output layers, attention gating in LSTMs and GRUs, and any situation where you need a smooth function mapping to (0, 1). Understanding why sigmoid fails in deep hidden layers is a key insight in the history of deep learning.
Related Terms
Last updated: February 22, 2026