Mixture of Experts Routing

Architecture

The mechanism within a Mixture of Experts model that determines which expert sub-networks process each input token, directly controlling how model capacity is utilized.

Mixture of Experts routing is the process by which a gating mechanism assigns input tokens to specific expert sub-networks within a Mixture of Experts architecture. The router is typically a small neural network, often a single linear layer followed by a softmax, that produces a score for each expert given an input token. The top-k experts by score are selected to process that token, and their outputs are combined using the router scores as weights.

Routing strategies vary in sophistication. Token-choice routing lets each token pick its preferred experts, which is simple but can cause load imbalance. Expert-choice routing inverts the process, letting each expert select which tokens to process from a batch, which naturally balances load but can leave some tokens underserved. Hash-based routing assigns tokens to experts deterministically based on token properties, eliminating the learned router entirely but sacrificing adaptive specialization.

The quality of routing determines whether a Mixture of Experts model lives up to its theoretical capacity advantage. Poor routing leads to expert collapse, where a few experts handle most tokens while others sit idle, or to misrouting, where tokens reach experts that are not specialized for them. Modern architectures like DeepSeek's Sparse Attention and Qwen 3.5's hybrid gating use auxiliary losses, capacity constraints, and learned affinity scores to keep routing both balanced and specialized. The routing decision happens at every layer of the network, so a single token may visit different experts at different depths, enabling fine-grained specialization throughout the model.

Last updated: March 5, 2026

Mixture of Experts Routing

Related Terms