Feed-Forward Network
Deep LearningA simple neural network layer within each transformer block that independently transforms each token's representation through two linear transformations with a non-linear activation in between.
In transformer architectures, the feed-forward network (FFN) is applied independently to each token position after the attention sub-layer. It typically consists of two linear transformations with a non-linear activation: FFN(x) = W_2 * activation(W_1 * x + b_1) + b_2. The inner dimension is usually 4x the model dimension (e.g., 2048 for d_model=512).
The FFN serves a complementary role to attention. While attention mixes information across token positions (inter-token processing), the FFN processes each token's representation independently (intra-token processing). Research suggests that FFN layers act as key-value memories, storing factual knowledge learned during training. The expansion to a larger inner dimension followed by compression back creates a bottleneck that forces the network to learn efficient representations.
Modern variants replace the standard two-layer FFN with gated architectures like SwiGLU (used in LLaMA, PaLM) which use a gating mechanism for better gradient flow. Each transformer block alternates between attention (communication between positions) and FFN (computation within each position), with residual connections and layer normalization around both sub-layers.
Related Terms
Last updated: February 22, 2026