sonic-moe by Dao-AILab

Accelerating Mixture-of-Experts (MoE) models

Created 5 months ago

710 stars

Top 47.8% on SourcePulse

View on GitHub

5 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Eric Zhang

Founding Engineer at Modal

Pawel Garbacki

Cofounder of Fireworks AI

Jeff Hammerbacher

Cofounder of Cloudera

and 1 more!

Project Summary

SonicMoE provides a high-performance implementation of Mixture-of-Experts (MoE) layers, specifically optimized for NVIDIA Hopper and Blackwell architecture GPUs. It addresses the computational bottlenecks in MoE models by employing IO-aware optimizations and leveraging CuTeDSL and Triton, aiming to deliver state-of-the-art training throughput and reduced activation memory usage. This project is targeted at researchers and engineers working with large-scale deep learning models who require efficient MoE implementations on modern NVIDIA hardware.

How It Works

SonicMoE accelerates MoE layers through a combination of IO-aware optimizations and tile-aware kernel designs, primarily implemented using CuTeDSL and Triton. The core approach builds upon the Grouped GEMM kernels from the QuACK library, which itself is based on CUTLASS. This design focuses on maximizing GPU utilization by efficiently managing memory access patterns and computation tiling, particularly beneficial for the memory-intensive operations characteristic of MoE architectures on advanced GPU architectures.

Quick Start & Requirements

Installation: Install via pip: pip install sonic-moe. Alternatively, clone the repository and install from source using pip install -r requirements.txt and pip install -e ..
Prerequisites: NVIDIA Hopper (H100, H200) or Blackwell (GB200, B200, B300) GPUs are required. CUDA 12.9+ (13.0+ for B300) and Python 3.12+ are recommended. PyTorch 2.7+ (2.9.1 recommended) is necessary. Users with B300 GPUs must manually upgrade Triton to 3.6.0 after installing PyTorch.
Links: GitHub Repository

Highlighted Details

Optimized for NVIDIA Hopper and Blackwell GPU architectures.
Leverages CuTeDSL and Triton for IO-aware and tile-aware kernel optimizations.
Supports various routing strategies (e.g., TC top-K, Qwen3-style) and weight layout formats (interleaved, concatenated).
Built upon Grouped GEMM kernels from the QuACK library.

Maintenance & Community

The project welcomes contributions through issues, feature requests, and pull requests. Specific community channels like Discord or Slack are not mentioned in the README.

Licensing & Compatibility

This project is licensed under the Apache License 2.0, which is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The implementation is specifically tailored for NVIDIA Hopper and Blackwell architectures, requiring recent CUDA versions (12.9+). Compatibility with older GPU architectures or CUDA versions is not guaranteed.

Health Check

Last Commit

4 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

27 stars in the last 30 days