Discover and explore top open-source AI tools and projects—updated daily.
osayamenjaFast distributed Mixture-of-Experts inference
Top 97.4% on SourcePulse
FlashMoE addresses the significant inference bottlenecks in Distributed Mixture-of-Experts (DMoE) models, characterized by low tensor core utilization and high communication overhead. This project delivers a fully fused, single-kernel system for high-throughput, low-latency DMoE inference. By eliminating kernel boundaries and enabling fine-grained overlap of communication and computation, FlashMoE offers substantial performance gains over existing solutions.
How It Works
FlashMoE's core innovation is complete kernel fusion, integrating MoE dispatch, expert computation, and MoE combine into a single, tile-pipelined persistent kernel. This approach embeds an "Operating System within the kernel" for concurrent task scheduling, effectively hiding system and communication latency. By enabling fine-grained overlap of communication and computation and exploiting task locality, FlashMoE minimizes GPU stalls and maximizes tensor core utilization, overcoming inefficiencies of traditional multi-kernel DMoE implementations.
Quick Start & Requirements
Installation via pip: pip install flashmoe-py[cu12] (or cu13). C++ integration uses CMake (CPMAddPackage). Prerequisites include CUDA toolkit, C++20, ninja, CMake (>= 3.28), and SM 70+ GPUs with P2P interconnect (NVLink, PCIe, GPUDirect RDMA). Dependencies: cuBLASDx, NVSHMEM. Links for cuBLASDx and NVSHMEM are provided.
Highlighted Details
3 weeks ago
Inactive
microsoft
Dao-AILab
ByteDance-Seed
databricks
deepseek-ai