prime-diloco by PrimeIntellect-ai

Framework for distributed AI model training over the internet

Created 1 year ago

849 stars

Top 42.1% on SourcePulse

View on GitHub

8 Experts Love This Project

Anton Troynikov

Cofounder of Chroma

Jeff Hammerbacher

Cofounder of Cloudera

Luis Capelo

Cofounder of Lightning AI

Georgios Konstantopoulos

CTO, General Partner at Paradigm

and 4 more!

Project Summary

Prime is a framework for efficient, globally distributed AI model training over the internet, designed for researchers and engineers tackling large-scale distributed training challenges. It addresses the complexities of fault tolerance, checkpointing, and communication overhead in decentralized environments, enabling more robust and scalable training.

How It Works

Prime introduces ElasticDeviceMesh for fault-tolerant, dynamic process group management across the internet, using heartbeats to detect and remove dead nodes without crashing. It implements asynchronous distributed checkpointing by first saving to RAM (/dev/shm) for speed, then copying to disk and remote storage asynchronously. A custom C++ Int8 All-Reduce kernel is provided for 4x payload reduction, with optimized uint8 quantization/dequantization ops for high bandwidth utilization. It also leverages PyTorch FSDP2/DTensor for ZeRO-3 sharding and CPU offloading for optimizer states.

Quick Start & Requirements

Install: curl -sSL https://raw.githubusercontent.com/PrimeIntellect-ai/prime/main/scripts/install/install.sh | bash followed by uv sync --extra all.
Prerequisites: Python 3.x, uv package manager, git, iperf, Hugging Face CLI login, and potentially downloading datasets (scripts/subset_data.py).
Setup Time: Varies based on dataset download and environment setup.
Docs: https://github.com/PrimeIntellect-ai/prime

Highlighted Details

Achieves up to 4Gb/s connections between data centers across the US using VPNs and optimized peer-to-peer connections.
Custom Int8 All-Reduce kernel with optimized uint8 ops improves quantization speed by over 60x.
Live checkpoint recovery allows nodes to join mid-training within a tight time window.
Implements PyTorch FSDP2/DTensor for ZeRO-3 sharding and CPU offloading.

Maintenance & Community

The project is actively developed by PrimeIntellect-ai. Further community and roadmap details are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The framework relies on specific environment variables for distributed setup and may require careful configuration of network settings (e.g., VPNs) for optimal performance. The setup process involves multiple steps and external scripts, indicating a potentially steep learning curve.

Health Check

Last Commit

1 month ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days