A mixture-of-experts layer replaces one big feed-forward block with many parallel “expert” blocks and a learned router that sends each token to only a few of them. The result is a model whose total parameter count can be enormous while the compute per token stays modest, because most experts sit idle for any given input.
Modern frontier-scale open models lean heavily on this trick: DeepSeek-V3, Qwen3’s larger variants, and the GLM-5 family all describe themselves by two numbers, total parameters and active parameters per token (for example 744B total with 40B active). The gap between those numbers is the MoE advantage: capacity you store but do not pay for on every forward pass.
