
Researchers at the Allen Institute for AI and UC Berkeley introduced EMO, a mixture — of-experts (MoE) language model that preserves performance when large portions of its expert table are removed. By forcing routing decisions to align across entire documents, EMO encourages experts to specialize by content domain rather than by shallow syntactic cues; that modularity lets a model retain much of its capability while running with a reduced expert subset, which matters for memory — constrained inference and targeted domain deployments.
EMO was trained as a large MoE with 128 experts, eight active experts per token, and what the authors describe as about 1 billion active parameters and 14 billion total parameters. Training consumed roughly 1 trillion tokens from the OLMoE pretraining corpus. As a full model, EMO matches an identically trained standard MoE and, the authors report, outperforms the OLMoE baseline despite using five times more data.
The core technical change is document — level routing. Instead of routing decisions being made independently for each token, EMO computes a shared pool of active experts for a document by averaging router preferences across its tokens. All tokens in the document then choose from that shared pool, which biases the model toward consistent, domain — level expert selection and leads to specialists for areas such as medicine or politics rather than experts keyed to local surface patterns.
Two training adjustments preserved that modularity. The team replaced local, per-batch load balancing with a global load-balancing computation across many documents so the bundling objective did not conflict with spread objectives. They also varied the size of the document — level expert pool randomly during training, teaching the model to operate correctly with different expert subgroup sizes at inference time.
In capability tests that remove experts at inference, EMO retains high accuracy: keeping 25 percent of experts (32 of 128) costs roughly one percentage point of absolute performance on average across multiple benchmarks, while keeping 12.5 percent (16 experts) yields about a three — point drop. By contrast, a standard MoE under the same ablation typically loses 10 — 15 percentage points and can sometimes fall below the accuracy of a dense model with the same number of active parameters.
For builders and deployers, EMO’s modularity enables explicit control over which content areas a loaded subset covers and significantly reduces the memory footprint needed for inference. The technique is compatible with existing MoE architectures and does not require manual domain labeling at scale, relying instead on document boundaries during pretraining; the authors present their results as a path toward more sliceable, predictable experts that preserve capability while enabling leaner runtime deployments.
Sources
Replies (0)
No replies in this topic yet.