DisTrO  by NousResearch

Distributed optimizers research paper

Created 1 year ago
957 stars

Top 38.4% on SourcePulse

GitHubView on GitHub
Project Summary

DisTrO is a framework for low-latency distributed optimizers designed to drastically reduce inter-GPU communication overhead in large-scale model training. It targets researchers and engineers working with distributed deep learning systems who need to optimize communication efficiency.

How It Works

DisTrO implements a family of optimizers that achieve communication reduction by three to four orders of magnitude. The core innovation lies in its approach to minimizing the data exchanged between GPUs, enabling more efficient distributed training, particularly over the internet.

Quick Start & Requirements

  • Installation: Not specified in README.
  • Prerequisites: Not specified in README.
  • Resources: Not specified in README.
  • Links:
    • Preliminary Report: [x] Aug. 26th, 2024
    • DeMo Optimization Paper: [x] Dec. 2nd, 2024
    • DeMo Optimization Code: [x] Dec. 2nd, 2024

Highlighted Details

  • Achieves 3-4 orders of magnitude reduction in inter-GPU communication.
  • Demonstrated training of a 15b model using DisTrO.
  • Related projects include Psyche Network and Nous Consilience 40b LLM.

Maintenance & Community

  • Community: Discord server available for collaboration.
  • Roadmap: Upcoming paper and code release.

Licensing & Compatibility

  • License: Not specified in README.
  • Compatibility: Not specified in README.

Limitations & Caveats

The project is presented as preliminary, with a formal paper and code release pending. Specific installation, requirements, and compatibility details are not yet available.

Health Check
Last Commit

4 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
20 more.

accelerate by huggingface

0.3%
9k
PyTorch training helper for distributed execution
Created 4 years ago
Updated 1 day ago
Feedback? Help us improve.