Framework for distributed AI model training over the internet
Top 45.6% on sourcepulse
Prime is a framework for efficient, globally distributed AI model training over the internet, designed for researchers and engineers tackling large-scale distributed training challenges. It addresses the complexities of fault tolerance, checkpointing, and communication overhead in decentralized environments, enabling more robust and scalable training.
How It Works
Prime introduces ElasticDeviceMesh
for fault-tolerant, dynamic process group management across the internet, using heartbeats to detect and remove dead nodes without crashing. It implements asynchronous distributed checkpointing by first saving to RAM (/dev/shm
) for speed, then copying to disk and remote storage asynchronously. A custom C++ Int8 All-Reduce kernel is provided for 4x payload reduction, with optimized uint8 quantization/dequantization ops for high bandwidth utilization. It also leverages PyTorch FSDP2/DTensor for ZeRO-3 sharding and CPU offloading for optimizer states.
Quick Start & Requirements
curl -sSL https://raw.githubusercontent.com/PrimeIntellect-ai/prime/main/scripts/install/install.sh | bash
followed by uv sync --extra all
.uv
package manager, git
, iperf
, Hugging Face CLI login, and potentially downloading datasets (scripts/subset_data.py
).Highlighted Details
Maintenance & Community
The project is actively developed by PrimeIntellect-ai. Further community and roadmap details are not explicitly provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The framework relies on specific environment variables for distributed setup and may require careful configuration of network settings (e.g., VPNs) for optimal performance. The setup process involves multiple steps and external scripts, indicating a potentially steep learning curve.
2 months ago
1+ week