Expert-parallel communication library for MoE, targeting high-throughput and low-latency
Top 6.3% on sourcepulse
DeepEP is a communication library designed for efficient expert-parallel (EP) and Mixture-of-Experts (MoE) models. It provides high-throughput, low-latency GPU kernels for MoE dispatch and combine operations, supporting low-precision formats like FP8 and optimized kernels for asymmetric domain bandwidth forwarding. The library targets researchers and engineers working with large-scale MoE models for both training and inference.
How It Works
DeepEP implements specialized all-to-all GPU kernels for MoE communication. It offers "normal" kernels optimized for high throughput, leveraging NVLink for intranode and RDMA for internode communication, with support for SM number control. For latency-sensitive inference decoding, "low-latency" kernels utilize pure RDMA to minimize delays. A key feature is its hook-based communication-computation overlapping method, which allows RDMA network traffic to occur in the background without occupying GPU SM resources, enhancing efficiency.
Quick Start & Requirements
NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py install
tests/utils.py
for cluster settings.Highlighted Details
Maintenance & Community
The project has seen recent performance enhancements (April 2025) through contributions from Tencent Network Platform Department. Community forks exist, such as Infrawaves/DeepEP_ibrc_dual-ports_multiQP
.
Licensing & Compatibility
Released under the MIT License, with codes referencing NVSHMEM subject to the NVSHMEM SLA. Compatible with commercial use, provided NVSHMEM SLA terms are met.
Limitations & Caveats
The implementation may have slight differences from the DeepSeek-V3 paper. A roadmap item indicates A100 support is planned but not yet available (intranode only). The library uses an "undefined-behavior PTX usage" for performance, which can be disabled via DISABLE_AGGRESSIVE_PTX_INSTRS=1
if issues arise on certain platforms. The default configurations are optimized for DeepSeek's internal cluster, recommending auto-tuning for other environments.
2 days ago
1 day