OpenDiloco by PrimeIntellect-ai

Framework for globally distributed low-communication training

Created 1 year ago

556 stars

Top 57.5% on SourcePulse

View on GitHub

7 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Anton Troynikov

Cofounder of Chroma

Jeff Hammerbacher

Cofounder of Cloudera

Omar Sanseviero

DevRel at Google DeepMind

and 3 more!

Project Summary

OpenDiLoCo provides an open-source framework for globally distributed, low-communication model training, targeting researchers and practitioners of large-scale deep learning. It aims to reduce communication overhead in distributed training by employing techniques that allow workers to perform more local computation before synchronizing, enabling efficient training across geographically dispersed or bandwidth-constrained environments.

How It Works

The framework leverages the Hivemind library for decentralized weight averaging and distributed hash tables (DHT) for peer discovery and coordination. It integrates with PyTorch's Fully Sharded Data Parallel (FSDP) for efficient model sharding. The core innovation lies in its "DiLoCo" (Distributed Low-Communication) approach, which allows workers to perform a configurable number of local training steps (hv.local-steps) before synchronizing gradients or parameters, thereby minimizing inter-node communication.

Quick Start & Requirements

Installation: Clone the repository with submodules (git clone --recursive) and install dependencies via pip install .. A Docker image (primeintellect/open_diloco:main) is also available.
Prerequisites: Python 3.11, PyTorch (nightly CPU build recommended for setup), optional CUDA toolkit for Flash Attention 2.
Setup: Requires setting up a Hivemind DHT instance for distributed runs.
Docs: https://github.com/PrimeIntellect-ai/OpenDiloco

Highlighted Details

Supports distributed training with reduced communication via local steps.
Integrates Hivemind for decentralized weight averaging and DHT for peer discovery.
Compatible with PyTorch FSDP for model sharding.
Includes example configurations for training Llama models (150M and 1B parameters).

Maintenance & Community

The project explicitly states it is "no longer maintained." Users are directed to a successor project, "prime," for production-ready solutions.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The project is explicitly marked as "no longer maintained." The Hivemind implementation may not handle Ctrl+C gracefully, requiring pkill to stop runs. Gradient scaling requires manual unscaling and passing the scaler to optimizer.step().

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days