Mixture of Experts (MoE)

A machine learning architecture that divides a model into multiple specialized sub-networks (experts) and activates only a subset per input, enabling larger model capacity with lower computational cost.

Mixture of Experts (MoE) is a neural network design originally proposed by Jacobs et al. in 1991 and later adapted for modern deep learning, notably in the sparse form introduced by Shazeer et al. in 2017. The core idea is to replace a single large feedforward layer with a collection of smaller, specialized sub-networks called experts. A gating mechanism, or router, learns to assign each input token to a small subset of experts, typically one or two, while the remaining experts are left inactive. This selective activation keeps the computational cost per token roughly constant even as the total number of experts grows, allowing the model to scale to billions of parameters without a proportional increase in inference or training compute.

In practice, MoE layers are often interspersed with standard transformer layers in large language models. The gating network outputs a probability distribution over experts, and only the top-k experts with the highest probabilities are activated. The outputs of the selected experts are then combined using the gating weights. To ensure load balance across experts, auxiliary losses are added during training to penalize situations where a few experts receive most of the inputs. This prevents expert collapse, where some experts become underutilized and fail to learn useful representations.

MoE has been successfully applied in models such as the Mixture of Experts Transformer, GLaM, and Switch Transformer. It enables training models with trillions of parameters, like the 1.6-trillion-parameter model from Google, while maintaining inference costs comparable to much smaller dense models. However, MoE introduces challenges in distributed training, memory management, and expert load balancing, requiring careful engineering to achieve efficient scaling.

Why it matters

MoE is a key technique for scaling neural networks to massive sizes without a proportional increase in computational cost, making it feasible to train and deploy models with hundreds of billions or trillions of parameters. It enables state-of-the-art performance in language modeling and other domains while keeping inference efficient, which is critical for real-world applications where latency and compute budgets are constrained. MoE also influences the design of large-scale distributed training systems, as experts can be placed on different devices to parallelize computation.

First appeared

Jacobs et al., 1991; modern sparse MoE: Shazeer et al., 2017.

FAQ

How does it work?

A gating network (router) processes each input token and outputs a probability distribution over a set of expert sub-networks. Only the top-k experts with the highest probabilities are activated. The outputs of these selected experts are weighted by their gating probabilities and summed to produce the final output. This selective activation keeps computational cost per token low, even as the total number of experts grows.

What are the main challenges of using MoE?

Key challenges include load balancing, where some experts may receive far more inputs than others, requiring auxiliary loss functions to encourage uniform usage. Training and inference also demand careful memory management, as all expert parameters must be stored even though only a subset is used per token. Distributed training is more complex because experts must be efficiently sharded across devices while minimizing communication overhead.

When should MoE be used instead of a dense model?

MoE is most beneficial when scaling model capacity beyond what is feasible with dense models due to computational or memory constraints. It is ideal for large-scale language models, recommendation systems, and other tasks where a very large number of parameters is needed but inference must remain efficient. For smaller models or tasks where compute is not a bottleneck, dense models may be simpler and equally effective.