DeepEP by deepseek-ai

Expert-parallel communication library for MoE, targeting high-throughput and low-latency

Created 10 months ago

8,875 stars

Top 5.8% on SourcePulse

View on GitHub

8 Experts Love This Project

Chaoyu Yang

Founder of Bento

Junyang Lin

Core Maintainer at Alibaba Qwen

Alexander Wettig

Coauthor of SWE-bench, SWE-agent

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

and 4 more!

Project Summary

DeepEP is a communication library designed for efficient expert-parallel (EP) and Mixture-of-Experts (MoE) models. It provides high-throughput, low-latency GPU kernels for MoE dispatch and combine operations, supporting low-precision formats like FP8 and optimized kernels for asymmetric domain bandwidth forwarding. The library targets researchers and engineers working with large-scale MoE models for both training and inference.

How It Works

DeepEP implements specialized all-to-all GPU kernels for MoE communication. It offers "normal" kernels optimized for high throughput, leveraging NVLink for intranode and RDMA for internode communication, with support for SM number control. For latency-sensitive inference decoding, "low-latency" kernels utilize pure RDMA to minimize delays. A key feature is its hook-based communication-computation overlapping method, which allows RDMA network traffic to occur in the background without occupying GPU SM resources, enhancing efficiency.

Quick Start & Requirements

Installation: NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py install
Prerequisites: Hopper GPUs (may support others), Python 3.8+, CUDA 12.3+, PyTorch 2.1+, NVLink, RDMA network, and a modified NVSHMEM dependency.
Setup: Requires downloading and installing NVSHMEM. Testing involves modifying tests/utils.py for cluster settings.
Documentation: DeepEP README

Highlighted Details

Achieves high throughput with NVLink and RDMA forwarding, with reported bandwidths up to 158 GB/s intranode and 58 GB/s internode.
Low-latency kernels achieve dispatch latencies as low as 163 us with 46 GB/s RDMA bandwidth.
Supports FP8 dispatching and BF16 combining.
Offers a hook-based mechanism for communication-computation overlap without SM occupation.

Maintenance & Community

The project has seen recent performance enhancements (April 2025) through contributions from Tencent Network Platform Department. Community forks exist, such as Infrawaves/DeepEP_ibrc_dual-ports_multiQP.

Licensing & Compatibility

Released under the MIT License, with codes referencing NVSHMEM subject to the NVSHMEM SLA. Compatible with commercial use, provided NVSHMEM SLA terms are met.

Limitations & Caveats

The implementation may have slight differences from the DeepSeek-V3 paper. A roadmap item indicates A100 support is planned but not yet available (intranode only). The library uses an "undefined-behavior PTX usage" for performance, which can be disabled via DISABLE_AGGRESSIVE_PTX_INSTRS=1 if issues arise on certain platforms. The default configurations are optimized for DeepSeek's internal cluster, recommending auto-tuning for other environments.

Health Check

Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

85 stars in the last 30 days