theLLMs

Last checked: 2026-05-25

Scope: Global. MoE architecture is well-established; specific model implementations and router behaviours may vary.

AI draft model: deepseek-v4-flash

AI review model: llm-editor (deepseek-v4-pro)

Mixture-of-experts models: why active parameters matter

A model labelled “7B” might not use all 7 billion parameters for every token. If it uses a mixture-of-experts architecture, it activates only a fraction of its total parameters per forward pass — and that distinction changes how you should think about cost, speed, and capability.

Quick answer

Mixture-of-experts (MoE) models split the feed-forward network into several “experts” and route each input token to only a subset of them. A model with 47 billion total parameters might activate only 7 billion per token, making it roughly as fast and memory-efficient as a dense 7B model while benefiting from the combined knowledge of all 47B.

The key number for cost and inference speed is active parameters, not total parameters. The key number for memory is total parameters.

What MoE actually does

In a standard dense transformer, every token passes through all parameters of every layer. In an MoE transformer, each layer contains several expert sub-networks, and a learned router decides which experts to activate for each token.

Typically, each token activates 1–2 experts per MoE layer out of 8, 16, or even 64 total experts. This means a 47B-parameter model with 7B active parameters uses roughly the same compute per token as a dense 7B model.

The router learns to specialise experts: one might handle code tokens, another legal language, another mathematical notation. But this specialisation emerges from training — the router is not manually programmed.

What the numbers actually tell you

Total parameters determines the model’s memory footprint and knowledge capacity. A 47B MoE model needs enough GPU memory to hold all 47B parameters (roughly 94 GB at FP16), even though it only uses ~7B per token.

Active parameters determines inference speed and per-token cost. A 7B-active MoE model is comparable to a dense 7B model in latency and throughput, assuming similar implementation efficiency.

Expert count affects both: more experts increase total capacity but can degrade router quality if the routing task becomes too sparse.

Where teams get confused

  1. Comparing MoE total parameters to dense model sizes. “Our 47B model beats GPT-3.5” conflates total and active parameters. The meaningful comparison for speed and cost is active parameters.

  2. Assuming MoE always means faster or cheaper. MoE models have higher memory requirements (all experts must be loaded) and can have higher per-token latency than an equivalent-parameter dense model if the router or expert switching overhead is high.

  3. Treating all MoE implementations as equivalent. Router architecture, expert granularity, load balancing, and capacity factor all affect real-world performance. Mixtral 8x7B, DeepSeek MoE, and Qwen MoE all make different design choices.

  4. Assuming expert specialisation makes MoE models modular. You cannot selectively load or unload individual experts to save memory without retraining. The router expects all experts to be present during inference.

Practical decision check

  • Active parameters, not total, drive per-token inference cost.
  • Total parameters determine GPU memory requirements — MoE models are memory-hungry even if they are fast.
  • MoE is most useful when you need broad knowledge capacity within a strict inference latency budget.
  • MoE models are more complex to serve — ensure your runtime supports them (vLLM does; llama.cpp has partial support; Ollama is workable for small MoE models).
  • For small-scale or latency-sensitive applications, a dense model with matching active parameters may be simpler and cheaper.

Methodology and sources

Check date: 2026-05-25

What was checked: MoE architecture papers (Shazeer et al. 2017, Fedus et al. 2022), model cards for Mixtral 8x7B, DeepSeek-V2, Qwen2-MoE, and community inference benchmarks.

Assumptions and limits: MoE efficiency depends heavily on implementation. The ratio of active to total parameters varies between model families. Router quality and load balancing are active research areas.

Source list

Change Log

  • 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.