Why Pruning Prevails for One-Shot MoE compression

[Submitted on 15 Oct 2025 (v1), last revised 10 Mar 2026 (this version, v2)]

View a PDF of the paper titled REAP the Experts: Why Pruning Prevails for One-Shot MoE compression, by Mike Lasby and 5 other authors

View PDF
HTML (experimental)

Abstract:Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert merging on discriminative benchmarks, we find that expert pruning is a superior strategy for generative tasks. We demonstrate that existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms to minimize the reconstruction error bound. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.

Submission history

From: Mike Lasby [view email]
[v1]
Wed, 15 Oct 2025 18:29:28 UTC (359 KB)
[v2]
Tue, 10 Mar 2026 14:03:57 UTC (373 KB)

Source link

Why Pruning Prevails for One-Shot MoE compression

Submission history

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Submission history

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You Might Also Like

What if AI becomes conscious and we never know

Uncovering Cas9 PAM diversity through metagenomic mining and machine learning

Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

PRX Part 3 — Training a Text-to-Image Model in 24h!