[2603.07685] Scalable Training of Mixture-of-Experts Models with Megatron Core

[Submitted on 8 Mar 2026 (v1), last revised 10 Mar 2026 (this version, v2)]

View a PDF of the paper titled Scalable Training of Mixture-of-Experts Models with Megatron Core, by Zijie Yan and 44 other authors

View PDF
HTML (experimental)

Abstract:Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack.

We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. As a performant, scalable, and production-ready open-source solution, it has been used across academia and industry for training MoE models ranging from billions to trillions of parameters on clusters scaling up to thousands of GPUs.

This report explains how these techniques work, their trade-offs, and their interactions at the systems level, providing practical guidance for scaling MoE models with Megatron Core.

Submission history

From: Zijie Yan [view email]
[v1]
Sun, 8 Mar 2026 15:42:43 UTC (17,013 KB)
[v2]
Tue, 10 Mar 2026 06:23:58 UTC (19,458 KB)

Source link

[2603.07685] Scalable Training of Mixture-of-Experts Models with Megatron Core

Submission history

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Submission history

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You Might Also Like

Scientists discover hidden geometry that bends electrons like gravity

Deep learning-based in silico labeling for analyzing morphological features of MSCs to predict immunomodulatory capacity

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs