sonic-moe  by Dao-AILab

Accelerating Mixture-of-Experts (MoE) models

Created 4 months ago
648 stars

Top 51.2% on SourcePulse

GitHubView on GitHub
Project Summary

SonicMoE provides a high-performance implementation of Mixture-of-Experts (MoE) layers, specifically optimized for NVIDIA Hopper and Blackwell architecture GPUs. It addresses the computational bottlenecks in MoE models by employing IO-aware optimizations and leveraging CuTeDSL and Triton, aiming to deliver state-of-the-art training throughput and reduced activation memory usage. This project is targeted at researchers and engineers working with large-scale deep learning models who require efficient MoE implementations on modern NVIDIA hardware.

How It Works

SonicMoE accelerates MoE layers through a combination of IO-aware optimizations and tile-aware kernel designs, primarily implemented using CuTeDSL and Triton. The core approach builds upon the Grouped GEMM kernels from the QuACK library, which itself is based on CUTLASS. This design focuses on maximizing GPU utilization by efficiently managing memory access patterns and computation tiling, particularly beneficial for the memory-intensive operations characteristic of MoE architectures on advanced GPU architectures.

Quick Start & Requirements

  • Installation: Install via pip: pip install sonic-moe. Alternatively, clone the repository and install from source using pip install -r requirements.txt and pip install -e ..
  • Prerequisites: NVIDIA Hopper (H100, H200) or Blackwell (GB200, B200, B300) GPUs are required. CUDA 12.9+ (13.0+ for B300) and Python 3.12+ are recommended. PyTorch 2.7+ (2.9.1 recommended) is necessary. Users with B300 GPUs must manually upgrade Triton to 3.6.0 after installing PyTorch.
  • Links: GitHub Repository

Highlighted Details

  • Optimized for NVIDIA Hopper and Blackwell GPU architectures.
  • Leverages CuTeDSL and Triton for IO-aware and tile-aware kernel optimizations.
  • Supports various routing strategies (e.g., TC top-K, Qwen3-style) and weight layout formats (interleaved, concatenated).
  • Built upon Grouped GEMM kernels from the QuACK library.

Maintenance & Community

The project welcomes contributions through issues, feature requests, and pull requests. Specific community channels like Discord or Slack are not mentioned in the README.

Licensing & Compatibility

This project is licensed under the Apache License 2.0, which is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The implementation is specifically tailored for NVIDIA Hopper and Blackwell architectures, requiring recent CUDA versions (12.9+). Compatibility with older GPU architectures or CUDA versions is not guaranteed.

Health Check
Last Commit

15 hours ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
4
Star History
40 stars in the last 30 days

Explore Similar Projects

Starred by Chris Lattner Chris Lattner(Author of LLVM, Clang, Swift, Mojo, MLIR; Cofounder of Modular), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
18 more.

open-infra-index by deepseek-ai

0.1%
8k
AI infrastructure tools for efficient AGI development
Created 1 year ago
Updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.1%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 4 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Eric Zhang Eric Zhang(Founding Engineer at Modal), and
9 more.

DeepGEMM by deepseek-ai

9.0%
7k
CUDA library for efficient FP8 GEMM kernels with fine-grained scaling
Created 1 year ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
9 more.

FlashMLA by deepseek-ai

0.1%
13k
Efficient CUDA kernels for MLA decoding
Created 1 year ago
Updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.4%
23k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.