awesome-distributed-ml by Shenggan

Comprehensive guide to distributed AI training and inference

Created 3 years ago

266 stars

Top 96.3% on SourcePulse

Project Summary

This repository serves as a curated catalog of open-source projects and research papers focused on distributed machine learning, particularly for training and inferring large models. It targets engineers, researchers, and practitioners seeking to scale their deep learning workloads beyond single-machine capabilities, offering a structured overview of state-of-the-art systems and techniques. The primary benefit is providing a centralized, categorized resource to navigate the complex landscape of distributed ML.

How It Works

The repository categorizes distributed ML approaches into key areas, detailing relevant projects and papers for each. It covers fundamental parallelism strategies such as pipeline, sequence, and mixture-of-experts (MoE) parallelism, alongside hybrid frameworks that combine multiple techniques. Further sections address critical system-level concerns including memory-efficient training, auto-parallelization, communication optimization, and fault-tolerant training mechanisms. This structured approach allows users to explore diverse architectural choices and algorithmic innovations for optimizing distributed training and inference.

Quick Start & Requirements

This repository is a curated list and does not provide a single installable project. Users should refer to the individual projects and papers listed for their specific installation instructions, dependencies (e.g., specific GPU/CUDA versions, Python versions), and setup requirements.

Highlighted Details

DeepSpeed & ColossalAI: Prominent unified deep learning systems offering comprehensive optimizations for large-scale distributed training and inference, including memory efficiency (ZeRO) and various parallelism strategies.
Megatron-LM & Alpa: Frameworks focusing on training massive transformer models at scale, with Megatron-LM emphasizing ongoing research and Alpa specializing in automated inter- and intra-operator parallelism.
Mixture-of-Experts (MoE) Systems: A rich collection of projects like GShard, Tutel, and MegaBlocks, addressing the challenges of conditional computation and sparse activation for extremely large models.
Memory Efficiency Techniques: Resources cover methods like ZeRO, Checkmate, and activation compression (ActNN, GACT) to reduce memory footprints, enabling training of trillion-parameter models.

Maintenance & Community

As a curated list, direct maintenance information for this repository is minimal, relying on community contributions. The listed projects (e.g., DeepSpeed, Megatron-LM, ColossalAI) often have active communities, dedicated GitHub repositories, and associated research publications.

Licensing & Compatibility

The repository itself does not specify a license. The licensing and compatibility for commercial use or closed-source linking depend entirely on the individual open-source projects and research papers referenced within the list. Users must consult the licenses of each specific tool or framework.

Limitations & Caveats

This resource is a collection of pointers to other projects and papers, not a unified framework itself. The field of distributed ML is rapidly evolving, meaning the list may not always reflect the absolute latest advancements. Users must evaluate each referenced project independently for its maturity, stability, and suitability.

awesome-distributed-ml by Shenggan

Explore Similar Projects

awesome-huge-models by zhengzangw

ArchScale by microsoft

ml-systems-papers by byungsoo-oh

gpt-oss-recipes by huggingface

libai by Oneflow-Inc

distributed-training-guide by LambdaLabsML

BMTrain by OpenBMB

nndeploy by nndeploy

alpa by alpa-projects

AIInfra by Infrasys-AI

ColossalAI by hpcaitech

Paddle by PaddlePaddle