DeepEP  by deepseek-ai

Expert-parallel communication library for MoE, targeting high-throughput and low-latency

created 5 months ago
8,327 stars

Top 6.3% on sourcepulse

GitHubView on GitHub
Project Summary

DeepEP is a communication library designed for efficient expert-parallel (EP) and Mixture-of-Experts (MoE) models. It provides high-throughput, low-latency GPU kernels for MoE dispatch and combine operations, supporting low-precision formats like FP8 and optimized kernels for asymmetric domain bandwidth forwarding. The library targets researchers and engineers working with large-scale MoE models for both training and inference.

How It Works

DeepEP implements specialized all-to-all GPU kernels for MoE communication. It offers "normal" kernels optimized for high throughput, leveraging NVLink for intranode and RDMA for internode communication, with support for SM number control. For latency-sensitive inference decoding, "low-latency" kernels utilize pure RDMA to minimize delays. A key feature is its hook-based communication-computation overlapping method, which allows RDMA network traffic to occur in the background without occupying GPU SM resources, enhancing efficiency.

Quick Start & Requirements

  • Installation: NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py install
  • Prerequisites: Hopper GPUs (may support others), Python 3.8+, CUDA 12.3+, PyTorch 2.1+, NVLink, RDMA network, and a modified NVSHMEM dependency.
  • Setup: Requires downloading and installing NVSHMEM. Testing involves modifying tests/utils.py for cluster settings.
  • Documentation: DeepEP README

Highlighted Details

  • Achieves high throughput with NVLink and RDMA forwarding, with reported bandwidths up to 158 GB/s intranode and 58 GB/s internode.
  • Low-latency kernels achieve dispatch latencies as low as 163 us with 46 GB/s RDMA bandwidth.
  • Supports FP8 dispatching and BF16 combining.
  • Offers a hook-based mechanism for communication-computation overlap without SM occupation.

Maintenance & Community

The project has seen recent performance enhancements (April 2025) through contributions from Tencent Network Platform Department. Community forks exist, such as Infrawaves/DeepEP_ibrc_dual-ports_multiQP.

Licensing & Compatibility

Released under the MIT License, with codes referencing NVSHMEM subject to the NVSHMEM SLA. Compatible with commercial use, provided NVSHMEM SLA terms are met.

Limitations & Caveats

The implementation may have slight differences from the DeepSeek-V3 paper. A roadmap item indicates A100 support is planned but not yet available (intranode only). The library uses an "undefined-behavior PTX usage" for performance, which can be disabled via DISABLE_AGGRESSIVE_PTX_INSTRS=1 if issues arise on certain platforms. The default configurations are optimized for DeepSeek's internal cluster, recommending auto-tuning for other environments.

Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
34
Issues (30d)
41
Star History
872 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 9 months ago
updated 1 day ago
Feedback? Help us improve.