Load Balancing

Architecture

In Mixture of Experts models, the set of techniques that ensure tokens are distributed evenly across experts during training and inference to prevent expert collapse and maximize model capacity.

Load balancing in the context of Mixture of Experts models refers to mechanisms that ensure all experts receive a roughly equal share of input tokens. Without explicit balancing, routers tend to converge on sending most tokens to a small number of experts, a failure mode called expert collapse. When this happens, the model effectively wastes most of its parameter capacity because unused experts never learn meaningful representations.

The most common approach is an auxiliary loss added to the training objective that penalizes imbalanced expert utilization. This loss measures the variance in how many tokens each expert receives and pushes the router toward more uniform distribution. The weight of this auxiliary loss must be carefully tuned: too low and collapse still occurs, too high and the router is forced into artificially uniform routing that ignores genuine specialization.

Other techniques include capacity factors that set hard limits on tokens per expert, dropping tokens that exceed an expert's capacity, and expert parallelism strategies that distribute experts across hardware so that load balancing also serves as a compute distribution mechanism. In production serving, load balancing extends to infrastructure concerns: ensuring that expert activations are distributed across GPUs or TPUs efficiently to minimize communication overhead and maximize throughput.

Last updated: March 5, 2026

Load Balancing

Related Terms