transformers-qwen3-moe-fused  by woct0rdho

Optimized MoE training for large language models

Created 10 months ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository addresses the significant performance bottleneck in Hugging Face Transformers' Mixture-of-Experts (MoE) models, particularly Qwen3 MoE, which suffer from slow training due to inefficient expert routing. It provides fused MoE kernels implemented in Triton, enabling users to fine-tune large MoE models on single GPUs with as little as 16GB VRAM, achieving substantially higher throughput. This is crucial for researchers and practitioners working with resource-constrained environments.

How It Works

The core innovation lies in the moe_fused_linear function, which replaces the slow for-loop expert access with optimized Triton kernels. These kernels leverage persistent workers and extensive autotuning for grouped GEMM operations, improving memory coalescence by sorting inputs by experts. This approach is designed to be significantly faster than fallback methods and offers advantages over other Triton implementations, especially on older Nvidia GPUs.

Quick Start & Requirements

  • Installation: Integrates with existing Hugging Face Transformers, PEFT, and Unsloth workflows. Example training scripts (example_train_30b_a3b_unsloth.py, example_train_30b_a3b_gguf.py) are provided.
  • Prerequisites: Nvidia GPU (optimized for RTX 3090/4090), PyTorch >= 2.10 recommended. Compatible with bitsandbytes (bnb 4-bit), GGUF, and Unsloth.
  • Links: No direct quick-start or demo links provided, but example scripts serve as usage guides.

Highlighted Details

  • LoRA Integration: Full support for LoRA and QLoRA, including seamless conversion between fused and unfused LoRA formats.
  • GGUF Training: Enables LoRA training over quantized GGUF models, potentially using sub-4-bit quantization for VRAM savings.
  • Fused Kernels: Includes optimized Triton kernels for fused softmax-topk and expert indexing.
  • Performance Claims: Significantly faster than standard Transformers MoE implementations and competitive with or superior to other Triton-based solutions on specific hardware.

Maintenance & Community

Code is actively being upstreamed into core libraries like Hugging Face Transformers, PEFT, and Unsloth. No specific community channels (e.g., Discord, Slack) or direct contributor information are detailed in the README.

Licensing & Compatibility

  • License: Apache-2.0.
  • Compatibility: Designed for seamless integration within the Hugging Face ecosystem (Transformers, PEFT, bitsandbytes) and compatible with Unsloth and GGUF formats, making it suitable for commercial use.

Limitations & Caveats

  • Multi-GPU: Multi-GPU support is not a primary focus; expert parallelism is out of scope, though DDP via Accelerate is mentioned as a possibility.
  • Hardware Optimization: Primarily optimized for RTX 3090/4090; further optimization is needed for newer architectures like RTX 5090.
  • Performance Trade-offs: Fusing 4-bit dequantization with MoE linear layers currently results in slower performance for large batch sizes compared to unfused versions.
Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Tri Dao Tri Dao(Chief Scientist at Together AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
1 more.

oslo by tunib-ai

0%
309
Framework for large-scale transformer optimization
Created 4 years ago
Updated 3 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.6%
1k
LLM inference engine for diverse applications
Created 2 years ago
Updated 5 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Sebastian Raschka Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)"), and
2 more.

SimpleTuner by bghira

0.2%
3k
Fine-tuning kit for diffusion models
Created 2 years ago
Updated 20 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
23 more.

Megatron-LM by NVIDIA

0.3%
16k
Framework for training transformer models at scale
Created 7 years ago
Updated 6 hours ago
Feedback? Help us improve.