awesome-distributed-ml  by Shenggan

Comprehensive guide to distributed AI training and inference

Created 3 years ago
253 stars

Top 99.3% on SourcePulse

GitHubView on GitHub
Project Summary

This repository serves as a curated catalog of open-source projects and research papers focused on distributed machine learning, particularly for training and inferring large models. It targets engineers, researchers, and practitioners seeking to scale their deep learning workloads beyond single-machine capabilities, offering a structured overview of state-of-the-art systems and techniques. The primary benefit is providing a centralized, categorized resource to navigate the complex landscape of distributed ML.

How It Works

The repository categorizes distributed ML approaches into key areas, detailing relevant projects and papers for each. It covers fundamental parallelism strategies such as pipeline, sequence, and mixture-of-experts (MoE) parallelism, alongside hybrid frameworks that combine multiple techniques. Further sections address critical system-level concerns including memory-efficient training, auto-parallelization, communication optimization, and fault-tolerant training mechanisms. This structured approach allows users to explore diverse architectural choices and algorithmic innovations for optimizing distributed training and inference.

Quick Start & Requirements

This repository is a curated list and does not provide a single installable project. Users should refer to the individual projects and papers listed for their specific installation instructions, dependencies (e.g., specific GPU/CUDA versions, Python versions), and setup requirements.

Highlighted Details

  • DeepSpeed & ColossalAI: Prominent unified deep learning systems offering comprehensive optimizations for large-scale distributed training and inference, including memory efficiency (ZeRO) and various parallelism strategies.
  • Megatron-LM & Alpa: Frameworks focusing on training massive transformer models at scale, with Megatron-LM emphasizing ongoing research and Alpa specializing in automated inter- and intra-operator parallelism.
  • Mixture-of-Experts (MoE) Systems: A rich collection of projects like GShard, Tutel, and MegaBlocks, addressing the challenges of conditional computation and sparse activation for extremely large models.
  • Memory Efficiency Techniques: Resources cover methods like ZeRO, Checkmate, and activation compression (ActNN, GACT) to reduce memory footprints, enabling training of trillion-parameter models.

Maintenance & Community

As a curated list, direct maintenance information for this repository is minimal, relying on community contributions. The listed projects (e.g., DeepSpeed, Megatron-LM, ColossalAI) often have active communities, dedicated GitHub repositories, and associated research publications.

Licensing & Compatibility

The repository itself does not specify a license. The licensing and compatibility for commercial use or closed-source linking depend entirely on the individual open-source projects and research papers referenced within the list. Users must consult the licenses of each specific tool or framework.

Limitations & Caveats

This resource is a collection of pointers to other projects and papers, not a unified framework itself. The field of distributed ML is rapidly evolving, meaning the list may not always reflect the absolute latest advancements. Users must evaluate each referenced project independently for its maturity, stability, and suitability.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
20 more.

alpa by alpa-projects

0.1%
3k
Auto-parallelization framework for large-scale neural network training and serving
Created 4 years ago
Updated 2 years ago
Feedback? Help us improve.