OpenDiloco  by PrimeIntellect-ai

Framework for globally distributed low-communication training

Created 1 year ago
531 stars

Top 59.8% on SourcePulse

GitHubView on GitHub
Project Summary

OpenDiLoCo provides an open-source framework for globally distributed, low-communication model training, targeting researchers and practitioners of large-scale deep learning. It aims to reduce communication overhead in distributed training by employing techniques that allow workers to perform more local computation before synchronizing, enabling efficient training across geographically dispersed or bandwidth-constrained environments.

How It Works

The framework leverages the Hivemind library for decentralized weight averaging and distributed hash tables (DHT) for peer discovery and coordination. It integrates with PyTorch's Fully Sharded Data Parallel (FSDP) for efficient model sharding. The core innovation lies in its "DiLoCo" (Distributed Low-Communication) approach, which allows workers to perform a configurable number of local training steps (hv.local-steps) before synchronizing gradients or parameters, thereby minimizing inter-node communication.

Quick Start & Requirements

  • Installation: Clone the repository with submodules (git clone --recursive) and install dependencies via pip install .. A Docker image (primeintellect/open_diloco:main) is also available.
  • Prerequisites: Python 3.11, PyTorch (nightly CPU build recommended for setup), optional CUDA toolkit for Flash Attention 2.
  • Setup: Requires setting up a Hivemind DHT instance for distributed runs.
  • Docs: https://github.com/PrimeIntellect-ai/OpenDiloco

Highlighted Details

  • Supports distributed training with reduced communication via local steps.
  • Integrates Hivemind for decentralized weight averaging and DHT for peer discovery.
  • Compatible with PyTorch FSDP for model sharding.
  • Includes example configurations for training Llama models (150M and 1B parameters).

Maintenance & Community

The project explicitly states it is "no longer maintained." Users are directed to a successor project, "prime," for production-ready solutions.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The project is explicitly marked as "no longer maintained." The Hivemind implementation may not handle Ctrl+C gracefully, requiring pkill to stop runs. Gradient scaling requires manual unscaling and passing the scaler to optimizer.step().

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
20 more.

alpa by alpa-projects

0.0%
3k
Auto-parallelization framework for large-scale neural network training and serving
Created 4 years ago
Updated 1 year ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
20 more.

accelerate by huggingface

0.3%
9k
PyTorch training helper for distributed execution
Created 4 years ago
Updated 1 day ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
26 more.

ColossalAI by hpcaitech

0.1%
41k
AI system for large-scale parallel training
Created 3 years ago
Updated 13 hours ago
Feedback? Help us improve.