Mixture-of-experts models: why active parameters matter

A model labelled “7B” might not use all 7 billion parameters for every token. If it uses a mixture-of-experts architecture, it activates only a fraction of its total parameters per forward pass — and that distinction changes how you should think about cost, speed, and capability.

TL;DR

Mixture-of-experts (MoE) models split the feed-forward network into several “experts” and route each input token to only a subset of them. A model with 47 billion total parameters might activate only 7 billion per token, making it roughly as fast and memory-efficient as a dense 7B model while benefiting from the combined knowledge of all 47B.

The key number for cost and inference speed is active parameters, not total parameters. The key number for memory is total parameters.

What MoE actually does

In a standard dense transformer, every token passes through all parameters of every layer. In an MoE transformer, each layer contains several expert sub-networks, and a learned router decides which experts to activate for each token.

Typically, each token activates 1–2 experts per MoE layer out of 8, 16, or even 64 total experts. This means a 47B-parameter model with 7B active parameters uses roughly the same compute per token as a dense 7B model.

The router learns to specialise experts: one might handle code tokens, another legal language, another mathematical notation. But this specialisation emerges from training — the router is not manually programmed.

What the numbers actually tell you

Total parameters determines the model’s memory footprint and knowledge capacity. A 47B MoE model needs enough GPU memory to hold all 47B parameters (roughly 94 GB at FP16), even though it only uses ~7B per token.

Active parameters determines inference speed and per-token cost. A 7B-active MoE model is comparable to a dense 7B model in latency and throughput, assuming similar implementation efficiency.

Expert count affects both: more experts increase total capacity but can degrade router quality if the routing task becomes too sparse.

Where teams get confused

Comparing MoE total parameters to dense model sizes. “Our 47B model beats GPT-3.5” conflates total and active parameters. The meaningful comparison for speed and cost is active parameters.
Assuming MoE always means faster or cheaper. MoE models have higher memory requirements (all experts must be loaded) and can have higher per-token latency than an equivalent-parameter dense model if the router or expert switching overhead is high.
Treating all MoE implementations as equivalent. Router architecture, expert granularity, load balancing, and capacity factor all affect real-world performance. Mixtral 8x7B, DeepSeek MoE, and Qwen MoE all make different design choices.
Assuming expert specialisation makes MoE models modular. You cannot selectively load or unload individual experts to save memory without retraining. The router expects all experts to be present during inference.

Practical decision check

Active parameters, not total, drive per-token inference cost.
Total parameters determine GPU memory requirements — MoE models are memory-hungry even if they are fast.
MoE is most useful when you need broad knowledge capacity within a strict inference latency budget.
MoE models are more complex to serve — ensure your runtime supports them (vLLM does; llama.cpp has partial support; Ollama is workable for small MoE models).
For small-scale or latency-sensitive applications, a dense model with matching active parameters may be simpler and cheaper.

Methodology

Data checked: 2026-05-28
Sources consulted: MoE architecture papers (Shazeer et al. 2017, Fedus et al. 2022), model cards for Mixtral 8x7B, DeepSeek-V2, Qwen2-MoE, community inference benchmarks, and vLLM MoE serving documentation
Assumptions: MoE efficiency depends heavily on implementation. The ratio of active to total parameters varies between model families. Router quality and load balancing are active research areas. Inference runtime support for MoE varies and is evolving.
Limitations: This article explains MoE architecture conceptually; it does not benchmark specific MoE models against dense alternatives. It does not cover training considerations for MoE models, distillation from MoE to dense architectures, or hardware-specific MoE optimisations. Runtime support claims reflect mid-2026 status and may change.
Jurisdiction: Global. MoE architecture is a technical concept with no jurisdiction-specific implications.

Source list

Outrageously Large Neural Networks (Shazeer et al. 2017) — https://arxiv.org/abs/1701.06538 (accessed 2026-05-28)
Switch Transformers (Fedus et al. 2022) — https://arxiv.org/abs/2101.03961 (accessed 2026-05-28)
Mixtral of Experts — https://arxiv.org/abs/2401.04088 (accessed 2026-05-28)
DeepSeek-V2 MoE architecture — https://arxiv.org/abs/2405.04434 (accessed 2026-05-28)
vLLM MoE serving documentation — https://docs.vllm.ai/en/stable/ (accessed 2026-05-28)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: Full editorial review against 16-gate checklist. Added three Editor’s Note aside cards, slugified all heading IDs, added Trust Stack section with corrections policy and affiliation declaration, corrected frontmatter writtenBy label, fixed truncated description, standardised Methodology and Source List formats with access dates.
2026-05-27: Added direct source URLs and Change Log section.