FlashMoE  by osayamenja

Fast distributed Mixture-of-Experts inference

Created 2 years ago
260 stars

Top 97.4% on SourcePulse

GitHubView on GitHub
Project Summary

FlashMoE addresses the significant inference bottlenecks in Distributed Mixture-of-Experts (DMoE) models, characterized by low tensor core utilization and high communication overhead. This project delivers a fully fused, single-kernel system for high-throughput, low-latency DMoE inference. By eliminating kernel boundaries and enabling fine-grained overlap of communication and computation, FlashMoE offers substantial performance gains over existing solutions.

How It Works

FlashMoE's core innovation is complete kernel fusion, integrating MoE dispatch, expert computation, and MoE combine into a single, tile-pipelined persistent kernel. This approach embeds an "Operating System within the kernel" for concurrent task scheduling, effectively hiding system and communication latency. By enabling fine-grained overlap of communication and computation and exploiting task locality, FlashMoE minimizes GPU stalls and maximizes tensor core utilization, overcoming inefficiencies of traditional multi-kernel DMoE implementations.

Quick Start & Requirements

Installation via pip: pip install flashmoe-py[cu12] (or cu13). C++ integration uses CMake (CPMAddPackage). Prerequisites include CUDA toolkit, C++20, ninja, CMake (>= 3.28), and SM 70+ GPUs with P2P interconnect (NVLink, PCIe, GPUDirect RDMA). Dependencies: cuBLASDx, NVSHMEM. Links for cuBLASDx and NVSHMEM are provided.

Highlighted Details

  • Up to 5x speedup and 69% tensor core utilization increase on frontier MoE models
Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Eric Zhang Eric Zhang(Founding Engineer at Modal), and
9 more.

DeepGEMM by deepseek-ai

0.5%
7k
CUDA library for efficient FP8 GEMM kernels with fine-grained scaling
Created 1 year ago
Updated 2 weeks ago
Feedback? Help us improve.