DisTrO by NousResearch

Distributed optimizers research paper

Created 1 year ago

973 stars

Top 37.9% on SourcePulse

6 Experts Love This Project

luiscape

Cofounder of Lightning AI

gakonst

Georgios Konstantopoulos

CTO, General Partner at Paradigm

jcjohnson

Professor at UMichigan

simonw

Coauthor of Django

and 2 more!

Project Summary

DisTrO is a framework for low-latency distributed optimizers designed to drastically reduce inter-GPU communication overhead in large-scale model training. It targets researchers and engineers working with distributed deep learning systems who need to optimize communication efficiency.

How It Works

DisTrO implements a family of optimizers that achieve communication reduction by three to four orders of magnitude. The core innovation lies in its approach to minimizing the data exchanged between GPUs, enabling more efficient distributed training, particularly over the internet.

Quick Start & Requirements

Installation: Not specified in README.
Prerequisites: Not specified in README.
Resources: Not specified in README.
Links:
- Preliminary Report: [x] Aug. 26th, 2024
- DeMo Optimization Paper: [x] Dec. 2nd, 2024
- DeMo Optimization Code: [x] Dec. 2nd, 2024

Highlighted Details

Achieves 3-4 orders of magnitude reduction in inter-GPU communication.
Demonstrated training of a 15b model using DisTrO.
Related projects include Psyche Network and Nous Consilience 40b LLM.

Maintenance & Community

Community: Discord server available for collaboration.
Roadmap: Upcoming paper and code release.

Licensing & Compatibility

License: Not specified in README.
Compatibility: Not specified in README.

Limitations & Caveats

The project is presented as preliminary, with a formal paper and code release pending. Specific installation, requirements, and compatibility details are not yet available.

Health Check

Last Commit

2 months ago

Responsiveness

1+ week

Pull Requests (30d)

0

Issues (30d)

0

Star History

2 stars in the last 30 days

Explore Similar Projects

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

TileRT by tile-ai

Ultra-low-latency LLM inference runtime

Created 2 months ago

Updated 2 weeks ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI) and

Will Brown

Will Brown(Research Lead at Prime Intellect).

nmoe by Noumena-Network

High-performance MoE trainer for NVIDIA B200 GPUs

Created 3 weeks ago

Updated 1 week ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Wing Lian

Wing Lian(Founder of Axolotl AI), and

1 more.

varuna by microsoft

Tool for efficient large DNN model training on commodity hardware

Created 4 years ago

Updated 1 year ago

ModelCenter by OpenBMB

Transformer library for efficient, low-resource, distributed training

Created 3 years ago

Updated 2 years ago

Starred by

Aravind Srinivas

Aravind Srinivas(Cofounder of Perplexity),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

7 more.

grokking-pytorch by Kaixhin

PyTorch guide with notes on usage, best practices, and debugging

Created 7 years ago

Updated 4 years ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

BMTrain by OpenBMB

Training toolkit for large AI models

Created 4 years ago

Updated 2 months ago

Starred by

Lei Zhang

Lei Zhang(Director Engineering AI at AMD) and

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

Triton-distributed by ByteDance-Seed

Distributed compiler for computation-communication overlapping, based on Triton

Created 9 months ago

Updated 2 weeks ago

FlagScale by flagos-ai

Large model toolkit for end-to-end management and scaling

Created 2 years ago

Updated 2 days ago

Starred by

Zhiqiang Xie

Zhiqiang Xie(Coauthor of SGLang) and

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face).

rccl by ROCm

ROCm library for GPU collective communication routines

Created 8 years ago

Updated 22 hours ago

Starred by

Clement Delangue

Clement Delangue(Cofounder of Hugging Face),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

20 more.

accelerate by huggingface

PyTorch training helper for distributed execution

Created 5 years ago

Updated 2 days ago

Starred by

Aravind Srinivas

Aravind Srinivas(Cofounder of Perplexity),

Johannes Hagemann

Johannes Hagemann(Cofounder of Prime Intellect), and

21 more.

apex by NVIDIA

PyTorch extension for streamlined mixed precision & distributed training

Created 7 years ago

Updated 2 weeks ago

Starred by

Aravind Srinivas

Aravind Srinivas(Cofounder of Perplexity),

François Chollet

François Chollet(Author of Keras; Cofounder of Ndea, ARC Prize), and

22 more.

horovod by horovod

Distributed training framework for TF, Keras, PyTorch, and MXNet

Created 8 years ago

Updated 1 month ago

Feedback? Help us improve.