transformers-qwen3-moe-fused by woct0rdho

Optimized MoE training for large language models

Created 10 months ago

254 stars

Top 99.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Daniel Han

Cofounder of Unsloth

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

This repository addresses the significant performance bottleneck in Hugging Face Transformers' Mixture-of-Experts (MoE) models, particularly Qwen3 MoE, which suffer from slow training due to inefficient expert routing. It provides fused MoE kernels implemented in Triton, enabling users to fine-tune large MoE models on single GPUs with as little as 16GB VRAM, achieving substantially higher throughput. This is crucial for researchers and practitioners working with resource-constrained environments.

How It Works

The core innovation lies in the moe_fused_linear function, which replaces the slow for-loop expert access with optimized Triton kernels. These kernels leverage persistent workers and extensive autotuning for grouped GEMM operations, improving memory coalescence by sorting inputs by experts. This approach is designed to be significantly faster than fallback methods and offers advantages over other Triton implementations, especially on older Nvidia GPUs.

Quick Start & Requirements

Installation: Integrates with existing Hugging Face Transformers, PEFT, and Unsloth workflows. Example training scripts (example_train_30b_a3b_unsloth.py, example_train_30b_a3b_gguf.py) are provided.
Prerequisites: Nvidia GPU (optimized for RTX 3090/4090), PyTorch >= 2.10 recommended. Compatible with bitsandbytes (bnb 4-bit), GGUF, and Unsloth.
Links: No direct quick-start or demo links provided, but example scripts serve as usage guides.

Highlighted Details

LoRA Integration: Full support for LoRA and QLoRA, including seamless conversion between fused and unfused LoRA formats.
GGUF Training: Enables LoRA training over quantized GGUF models, potentially using sub-4-bit quantization for VRAM savings.
Fused Kernels: Includes optimized Triton kernels for fused softmax-topk and expert indexing.
Performance Claims: Significantly faster than standard Transformers MoE implementations and competitive with or superior to other Triton-based solutions on specific hardware.

Maintenance & Community

Code is actively being upstreamed into core libraries like Hugging Face Transformers, PEFT, and Unsloth. No specific community channels (e.g., Discord, Slack) or direct contributor information are detailed in the README.

Licensing & Compatibility

License: Apache-2.0.
Compatibility: Designed for seamless integration within the Hugging Face ecosystem (Transformers, PEFT, bitsandbytes) and compatible with Unsloth and GGUF formats, making it suitable for commercial use.

Limitations & Caveats

Multi-GPU: Multi-GPU support is not a primary focus; expert parallelism is out of scope, though DDP via Accelerate is mentioned as a possibility.
Hardware Optimization: Primarily optimized for RTX 3090/4090; further optimization is needed for newer architectures like RTX 5090.
Performance Trade-offs: Fusing 4-bit dequantization with MoE linear layers currently results in slower performance for large batch sizes compared to unfused versions.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days