LPLB by deepseek-ai

MoE load balancer optimizing expert workload distribution via linear programming

Created 3 months ago

499 stars

Top 62.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

Summary Deepseek-ai/LPLB is an early-stage research project introducing a parallel load balancer for MoE models. It leverages linear programming to optimize expert workload distribution, targeting researchers and engineers working with MoE architectures. The primary benefit is mitigating dynamic load imbalances during training by intelligently reordering experts and assigning tokens.

How It Works LPLB extends the Expert Parallelism Load Balancer (EPLB) by employing linear programming (LP) to dynamically rebalance token assignments on a per-batch basis. It formulates the load balancing problem to minimize imbalance within an expert-parallel group, while respecting edge capacities defined by token counts. Real-time workload statistics are synchronized efficiently using NVLINK and NVSHMEM, significantly reducing communication overhead compared to standard distributed primitives like torch.distributed.allreduce.

Quick Start & Requirements

Prerequisites: CUDA Toolkit >= 12.6.3 (with cuSolverDx dependencies), DeepEP (optional but strongly recommended for practical use), EPLB (embedded).
Installation: Execute ./download-mathdx.sh, set NVSHMEM_DIR=..., then run pip install --no-build-isolation . or pip install --no-build-isolation --editable . for testing.
Documentation: Interface and example usage are provided within the README.

Highlighted Details

Utilizes an embedded LP solver implementing a single-SM Interior Point Method, leveraging NVIDIA's cuSolverDx and cuBLASDx for accelerated linear algebra operations.
Features optimized real-time workload synchronization via NVLINK/NVSHMEM, bypassing torch.distributed.allreduce to minimize communication latency.
Supports configurable expert replication topologies including Cube, Hypercube, and Torus, which can be customized by modifying the r2o matrix.

Maintenance & Community No specific details regarding notable contributors, sponsorships, partnerships, or community channels (e.g., Discord, Slack) are provided in the README.

Licensing & Compatibility The README does not specify a license type or provide compatibility notes relevant for commercial use or closed-source linking.

Limitations & Caveats The current planner optimizes for total token count, not the non-linear computational costs of grouped matrix multiplications, which may lead to suboptimal performance. Solver latency (~100 µs intra-node) can be non-negligible for small batches. Under extreme global load imbalance, LPLB may perform worse than EPLB due to differences in assigning redundant experts.

LPLB by deepseek-ai

Explore Similar Projects

MoE-plus-plus by SkyworkAI

GRIN-MoE by microsoft

varuna by microsoft

nmoe by Noumena-Network

Moonlight by MoonshotAI

BMTrain by OpenBMB

ucc by openucx

FlagScale by flagos-ai

EPLB by deepseek-ai

megablocks by databricks

nanotron by huggingface

accelerate by huggingface