Glossary Entry

Mixture of Experts

A neural architecture that routes each input to a small subset of specialist subnetworks, so a model can hold huge total capacity while spending only a fraction of it per token.

Architecture Models

Also called: MoE, mixture-of-experts, sparse mixture of experts

Seed source: Shazeer et al. 2017

A mixture-of-experts layer replaces one big feed-forward block with many parallel “expert” blocks and a learned router that sends each token to only a few of them. The result is a model whose total parameter count can be enormous while the compute per token stays modest, because most experts sit idle for any given input.

Modern frontier-scale open models lean heavily on this trick: DeepSeek-V3, Qwen3’s larger variants, and the GLM-5 family all describe themselves by two numbers, total parameters and active parameters per token (for example 744B total with 40B active). The gap between those numbers is the MoE advantage: capacity you store but do not pay for on every forward pass.