FlashMoE by osayamenja

Fast distributed Mixture-of-Experts inference

Created 2 years ago

271 stars

Top 94.8% on SourcePulse

Project Summary

FlashMoE addresses the significant inference bottlenecks in Distributed Mixture-of-Experts (DMoE) models, characterized by low tensor core utilization and high communication overhead. This project delivers a fully fused, single-kernel system for high-throughput, low-latency DMoE inference. By eliminating kernel boundaries and enabling fine-grained overlap of communication and computation, FlashMoE offers substantial performance gains over existing solutions.

How It Works

FlashMoE's core innovation is complete kernel fusion, integrating MoE dispatch, expert computation, and MoE combine into a single, tile-pipelined persistent kernel. This approach embeds an "Operating System within the kernel" for concurrent task scheduling, effectively hiding system and communication latency. By enabling fine-grained overlap of communication and computation and exploiting task locality, FlashMoE minimizes GPU stalls and maximizes tensor core utilization, overcoming inefficiencies of traditional multi-kernel DMoE implementations.

Quick Start & Requirements

Installation via pip: pip install flashmoe-py[cu12] (or cu13). C++ integration uses CMake (CPMAddPackage). Prerequisites include CUDA toolkit, C++20, ninja, CMake (>= 3.28), and SM 70+ GPUs with P2P interconnect (NVLink, PCIe, GPUDirect RDMA). Dependencies: cuBLASDx, NVSHMEM. Links for cuBLASDx and NVSHMEM are provided.

Highlighted Details

Up to 5x speedup and 69% tensor core utilization increase on frontier MoE models

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

1

Issues (30d)

0

Star History

5 stars in the last 30 days

Explore Similar Projects

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Wing Lian

Wing Lian(Founder of Axolotl AI), and

1 more.

varuna by microsoft

Tool for efficient large DNN model training on commodity hardware

Created 5 years ago

Updated 1 year ago

FlashRT by flashrt-project

High-performance realtime inference engine for AI workloads

Created 2 months ago

Updated 1 day ago

FlashQLA by QwenLM

Accelerate AI workloads with high-performance linear attention kernels

Created 2 months ago

Updated 2 days ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo),

Luis Capelo

Luis Capelo(Cofounder of Lightning AI), and

4 more.

sonic-moe by Dao-AILab

Accelerating Mixture-of-Experts (MoE) models

Created 6 months ago

Updated 1 week ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI),

Zhiqiang Xie

Zhiqiang Xie(Coauthor of SGLang), and

1 more.

TileRT by tile-ai

Ultra-low-latency LLM inference runtime

Created 8 months ago

Updated 1 month ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

fastertransformer_backend by triton-inference-server

Triton backend for optimized transformer inference

Created 5 years ago

Updated 2 years ago

Starred by

Lei Zhang

Lei Zhang(Director Engineering AI at AMD) and

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

Triton-distributed by ByteDance-Seed

Distributed compiler for computation-communication overlapping, based on Triton

Created 1 year ago

Updated 2 weeks ago

bolt by huawei-noah

Deep learning library for high-performance, heterogeneous deployment

Created 6 years ago

Updated 1 year ago

Starred by

Matei Zaharia

Matei Zaharia(Cofounder of Databricks),

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory), and

18 more.

megablocks by databricks

Lightweight library for mixture-of-experts (MoE) training

Created 3 years ago

Updated 3 months ago

aiter by ROCm

High-performance AI operator library for ROCm

Created 1 year ago

Updated 17 hours ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

10 more.

DeepGEMM by deepseek-ai

CUDA library for efficient FP8 GEMM kernels with fine-grained scaling

Created 1 year ago

Updated 5 days ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Chaoyu Yang

Chaoyu Yang(Founder of Bento), and

7 more.

DeepEP by deepseek-ai

Expert-parallel communication library for MoE, targeting high-throughput and low-latency

Created 1 year ago

Updated 1 day ago

Feedback? Help us improve.