LPLB  by deepseek-ai

MoE load balancer optimizing expert workload distribution via linear programming

Created 1 week ago

New!

418 stars

Top 70.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary Deepseek-ai/LPLB is an early-stage research project introducing a parallel load balancer for MoE models. It leverages linear programming to optimize expert workload distribution, targeting researchers and engineers working with MoE architectures. The primary benefit is mitigating dynamic load imbalances during training by intelligently reordering experts and assigning tokens.

How It Works LPLB extends the Expert Parallelism Load Balancer (EPLB) by employing linear programming (LP) to dynamically rebalance token assignments on a per-batch basis. It formulates the load balancing problem to minimize imbalance within an expert-parallel group, while respecting edge capacities defined by token counts. Real-time workload statistics are synchronized efficiently using NVLINK and NVSHMEM, significantly reducing communication overhead compared to standard distributed primitives like torch.distributed.allreduce.

Quick Start & Requirements

  • Prerequisites: CUDA Toolkit >= 12.6.3 (with cuSolverDx dependencies), DeepEP (optional but strongly recommended for practical use), EPLB (embedded).
  • Installation: Execute ./download-mathdx.sh, set NVSHMEM_DIR=..., then run pip install --no-build-isolation . or pip install --no-build-isolation --editable . for testing.
  • Documentation: Interface and example usage are provided within the README.

Highlighted Details

  • Utilizes an embedded LP solver implementing a single-SM Interior Point Method, leveraging NVIDIA's cuSolverDx and cuBLASDx for accelerated linear algebra operations.
  • Features optimized real-time workload synchronization via NVLINK/NVSHMEM, bypassing torch.distributed.allreduce to minimize communication latency.
  • Supports configurable expert replication topologies including Cube, Hypercube, and Torus, which can be customized by modifying the r2o matrix.

Maintenance & Community No specific details regarding notable contributors, sponsorships, partnerships, or community channels (e.g., Discord, Slack) are provided in the README.

Licensing & Compatibility The README does not specify a license type or provide compatibility notes relevant for commercial use or closed-source linking.

Limitations & Caveats The current planner optimizes for total token count, not the non-linear computational costs of grouped matrix multiplications, which may lead to suboptimal performance. Solver latency (~100 µs intra-node) can be non-negligible for small batches. Under extreme global load imbalance, LPLB may perform worse than EPLB due to differences in assigning redundant experts.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
422 stars in the last 12 days

Explore Similar Projects

Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
20 more.

accelerate by huggingface

0.2%
9k
PyTorch training helper for distributed execution
Created 5 years ago
Updated 2 days ago
Feedback? Help us improve.