Discover and explore top open-source AI tools and projects—updated daily.
woct0rdhoOptimized MoE training for large language models
Top 99.1% on SourcePulse
This repository addresses the significant performance bottleneck in Hugging Face Transformers' Mixture-of-Experts (MoE) models, particularly Qwen3 MoE, which suffer from slow training due to inefficient expert routing. It provides fused MoE kernels implemented in Triton, enabling users to fine-tune large MoE models on single GPUs with as little as 16GB VRAM, achieving substantially higher throughput. This is crucial for researchers and practitioners working with resource-constrained environments.
How It Works
The core innovation lies in the moe_fused_linear function, which replaces the slow for-loop expert access with optimized Triton kernels. These kernels leverage persistent workers and extensive autotuning for grouped GEMM operations, improving memory coalescence by sorting inputs by experts. This approach is designed to be significantly faster than fallback methods and offers advantages over other Triton implementations, especially on older Nvidia GPUs.
Quick Start & Requirements
example_train_30b_a3b_unsloth.py, example_train_30b_a3b_gguf.py) are provided.Highlighted Details
Maintenance & Community
Code is actively being upstreamed into core libraries like Hugging Face Transformers, PEFT, and Unsloth. No specific community channels (e.g., Discord, Slack) or direct contributor information are detailed in the README.
Licensing & Compatibility
Limitations & Caveats
2 months ago
Inactive
tunib-ai
Dao-AILab
ELS-RD
alibaba
bghira
NVIDIA