Latest Mixture of Experts Research Papers
The newest Mixture of Experts papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Mixture of Experts so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Mixture of Experts papers in your inbox — free →Recent papers
- OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinibAbhijoy Sarkar, Aarchi Singh Thakur · arXiv · Jun 9, 2026
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computatio…
- Reversible Foundations: Training a 120B Sparse MoE through State-Preserving ScalingRohan Shravan · arXiv · Jun 5, 2026
This paper reports on training a hundred-billion-parameter sparse mixture of experts on a single eight-GPU node, end to end. LightningLM 0.1V is a recurrence-backbone language model family grown in four stages from a small dense seed, throu…
- GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial TranscriptomicsKaito Shiku, Ahtisham Fazeel Abbasi, Ryoma Bise, Yuichiro Iwashita et al. · arXiv · Jun 1, 2026
Histology-based single-cell spatial transcriptomics (ST) estimation aims to predict gene expression for individual cells from histopathological images and cell locations, reducing the need for costly single-cell ST measurements. Unlike exis…
- MobileMoE: Scaling On-Device Mixture of ExpertsYanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka et al. · arXiv · May 26, 2026
Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMo…
- FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly DetectionHuanchi Wang, Zihang Huang, Yifang Tian, Kristina Dzeparoska et al. · arXiv · May 21, 2026
Production systems generate millions of log lines daily, yet most anomaly detectors operate at the session or window-level, flagging groups of lines rather than identifying the specific message responsible. This coarse granularity forces op…
- SURGE: Approximation-free Training Free Particle Filter for Diffusion SurrogateLifu Wei, Yinuo Ren, Naichen Shi, Yiping Lu · arXiv · May 18, 2026
Diffusion-based generative models increasingly rely on inference-time guidance, adding a drift term or reweighting mixture of experts, to improve sample quality on task-specific objectives. However, most existing techniques require repeated…
- Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts RoutingEllwil Sharma, Arastu Sharma · arXiv · May 14, 2026
Scaling Scientific Machine Learning (SciML) toward universal foundation models is bottlenecked by negative transfer: the simultaneous co-training of disparate partial differential equation (PDE) regimes can induce gradient conflict, unstabl…
- Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-ExpertsSagi Ahrac, Noya Hochwald, Mor Geva · arXiv · May 12, 2026
Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by …
- DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side DevicesChenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao et al. · arXiv · May 11, 2026
While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment …
- UniPool: A Globally Shared Expert Pool for Mixture-of-ExpertsMinbin Huang, Han Shi, Chuanyang Zheng, Yimeng Wu et al. · arXiv · May 7, 2026
Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes …
- DMoE-LLM: A dual-branch mixture-of-experts framework with large language models for wind power forecastingXingyu Feng, Dekang Guo, Mingshun Ye · Expert Systems with Applica... · Apr 30, 2026
- Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language ModelsGongbo Zhang, Wen Wang, Ye Tian, Li Yuan · arXiv · Apr 29, 2026
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference…
- FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts ServingMinghe Wang, Trever Schirmer, Mohammadreza Malekabbasi, David Bermbach · arXiv · Apr 29, 2026
Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside in memory, creating a gap between the re…
- Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert MergingLujun Li, Qiyuan Zhu, Jiacheng Wang, Xiaoyu Qin et al. · Proceedings of the AAAI Con... · Mar 14, 2026
Mixture of Experts (MoE) LLMs face significant obstacles due to their massive parameter scale, which imposes memory, storage, and deployment challenges. Although recent expert merging methods aim to achieve greater efficiency by consolidati…
- Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic ReasoningNvidia Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta et al. · arXiv.org · Dec 23, 2025
We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by super…
- MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert ParallelismRui Zhu, Ziheng Jiang, Chao Jin, Peng Wu et al. · Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication · Aug 27, 2025
Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) f…
- Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language ModelsChangxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu et al. · arXiv.org · Jul 23, 2025
Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the m…
- Learning Robust Stereo Matching in the Wild with Selective Mixture-of-ExpertsYun Wang, Longguang Wang, Chenghao Zhang, Yongjian Zhang et al. · IEEE International Conference on Computer Vision · Jul 6, 2025
Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among…
- I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-ExpertsJ. Xin, Sukwon Yun, Jie Peng, Inyoung Choi et al. · International Conference on Machine Learning · May 25, 2025
Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, vanilla fusion methods are limited by (1) inability to account for heterogeneous interactions between modalities a…
- DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous DrivingZhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li et al. · arXiv.org · May 22, 2025
End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-…
- MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity GuidanceXing Hu, Zhixuan Chen, Dawei Yang, Zukang Xu et al. · International Conference on Machine Learning · May 2, 2025
Mixture-of-Experts (MoE) large language models (LLMs), which leverage dynamic routing and sparse activation to enhance efficiency and scalability, have achieved higher performance while reducing computational costs. However, these models fa…
- ARTEMIS: Autoregressive End-to-End Trajectory Planning With Mixture of Experts for Autonomous DrivingRenju Feng, Ning Xi, Duanfeng Chu, Rukang Wang et al. · IEEE Robotics and Automation Letters · Apr 28, 2025
This letter presents ARTEMIS, an end-to-end autonomous driving framework that combines autoregressive trajectory planning with Mixture-of-Experts (MoE). Traditional modular methods suffer from error propagation, while existing end-to-end mo…
- A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and ApplicationsSiyuan Mu, Sen-Fon Lin · arXiv.org · Mar 10, 2025
Artificial intelligence (AI) has achieved astonishing successes in many domains, especially with the recent breakthroughs in the development of foundational large models. These large models, leveraging their extensive training data, provide…
- CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question AnsweringTianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen et al. · Computer Vision and Pattern Recognition · Mar 1, 2025
Multimodal large language models (MLLMs) have garnered widespread attention from researchers due to their remarkable understanding and generation capabilities in visual language tasks (e.g., visual question answering). However, the rapid pa…
- Comet: Fine-grained Computation-communication Overlapping for Mixture-of-ExpertsShulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang et al. · Conference on Machine Learning and Systems · Feb 27, 2025
Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the p…
- Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initializationTaishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda et al. · International Conference on Learning Representations · Feb 26, 2025
The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense mo…
- Training Sparse Mixture Of Experts Text Embedding ModelsZach Nussbaum, Brandon Duderstadt · arXiv.org · Feb 11, 2025
Transformer-based text embedding models have improved their performance on benchmarks like MIRACL and BEIR by increasing their parameter counts. However, this scaling approach introduces significant deployment challenges, including increase…
- Hierarchical Time-Aware Mixture of Experts for Multi-Modal Sequential RecommendationShengzhe Zhang, Liyi Chen, Dazhong Shen, Chao Wang et al. · The Web Conference · Jan 24, 2025
Multi-modal sequential recommendation (SR) leverages multi-modal data to learn more comprehensive item features and user preferences than traditional SR methods, which has become a critical topic in both academia and industry. Existing meth…
- Modality Interactive Mixture-of-Experts for Fake News DetectionYifan Liu, Yaokun Liu, Zelin Li, Ruichen Yao et al. · The Web Conference · Jan 21, 2025
The proliferation of fake news on social media platforms disproportionately impacts vulnerable populations, eroding trust, exacerbating inequality, and amplifying harmful narratives. Detecting fake news in multimodal contexts-where deceptiv…
- Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language ModelsSamira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali et al. · International Conference on Machine Learning · Jan 21, 2025
Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the …