Framework for globally distributed low-communication training
Top 61.4% on sourcepulse
OpenDiLoCo provides an open-source framework for globally distributed, low-communication model training, targeting researchers and practitioners of large-scale deep learning. It aims to reduce communication overhead in distributed training by employing techniques that allow workers to perform more local computation before synchronizing, enabling efficient training across geographically dispersed or bandwidth-constrained environments.
How It Works
The framework leverages the Hivemind library for decentralized weight averaging and distributed hash tables (DHT) for peer discovery and coordination. It integrates with PyTorch's Fully Sharded Data Parallel (FSDP) for efficient model sharding. The core innovation lies in its "DiLoCo" (Distributed Low-Communication) approach, which allows workers to perform a configurable number of local training steps (hv.local-steps
) before synchronizing gradients or parameters, thereby minimizing inter-node communication.
Quick Start & Requirements
git clone --recursive
) and install dependencies via pip install .
. A Docker image (primeintellect/open_diloco:main
) is also available.Highlighted Details
Maintenance & Community
The project explicitly states it is "no longer maintained." Users are directed to a successor project, "prime," for production-ready solutions.
Licensing & Compatibility
The repository does not explicitly state a license in the README.
Limitations & Caveats
The project is explicitly marked as "no longer maintained." The Hivemind implementation may not handle Ctrl+C
gracefully, requiring pkill
to stop runs. Gradient scaling requires manual unscaling and passing the scaler to optimizer.step()
.
6 months ago
1 week