OpenDiloco  by PrimeIntellect-ai

Framework for globally distributed low-communication training

created 1 year ago
520 stars

Top 61.4% on sourcepulse

GitHubView on GitHub
Project Summary

OpenDiLoCo provides an open-source framework for globally distributed, low-communication model training, targeting researchers and practitioners of large-scale deep learning. It aims to reduce communication overhead in distributed training by employing techniques that allow workers to perform more local computation before synchronizing, enabling efficient training across geographically dispersed or bandwidth-constrained environments.

How It Works

The framework leverages the Hivemind library for decentralized weight averaging and distributed hash tables (DHT) for peer discovery and coordination. It integrates with PyTorch's Fully Sharded Data Parallel (FSDP) for efficient model sharding. The core innovation lies in its "DiLoCo" (Distributed Low-Communication) approach, which allows workers to perform a configurable number of local training steps (hv.local-steps) before synchronizing gradients or parameters, thereby minimizing inter-node communication.

Quick Start & Requirements

  • Installation: Clone the repository with submodules (git clone --recursive) and install dependencies via pip install .. A Docker image (primeintellect/open_diloco:main) is also available.
  • Prerequisites: Python 3.11, PyTorch (nightly CPU build recommended for setup), optional CUDA toolkit for Flash Attention 2.
  • Setup: Requires setting up a Hivemind DHT instance for distributed runs.
  • Docs: https://github.com/PrimeIntellect-ai/OpenDiloco

Highlighted Details

  • Supports distributed training with reduced communication via local steps.
  • Integrates Hivemind for decentralized weight averaging and DHT for peer discovery.
  • Compatible with PyTorch FSDP for model sharding.
  • Includes example configurations for training Llama models (150M and 1B parameters).

Maintenance & Community

The project explicitly states it is "no longer maintained." Users are directed to a successor project, "prime," for production-ready solutions.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The project is explicitly marked as "no longer maintained." The Hivemind implementation may not handle Ctrl+C gracefully, requiring pkill to stop runs. Gradient scaling requires manual unscaling and passing the scaler to optimizer.step().

Health Check
Last commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
29 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Zhiqiang Xie Zhiqiang Xie(Author of SGLang).

veScale by volcengine

0.1%
839
PyTorch-native framework for LLM training
created 1 year ago
updated 3 weeks ago
Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Travis Fischer Travis Fischer(Founder of Agentic).

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
created 9 months ago
updated 2 weeks ago
Feedback? Help us improve.