prime  by PrimeIntellect-ai

Framework for distributed AI model training over the internet

created 10 months ago
783 stars

Top 45.6% on sourcepulse

GitHubView on GitHub
Project Summary

Prime is a framework for efficient, globally distributed AI model training over the internet, designed for researchers and engineers tackling large-scale distributed training challenges. It addresses the complexities of fault tolerance, checkpointing, and communication overhead in decentralized environments, enabling more robust and scalable training.

How It Works

Prime introduces ElasticDeviceMesh for fault-tolerant, dynamic process group management across the internet, using heartbeats to detect and remove dead nodes without crashing. It implements asynchronous distributed checkpointing by first saving to RAM (/dev/shm) for speed, then copying to disk and remote storage asynchronously. A custom C++ Int8 All-Reduce kernel is provided for 4x payload reduction, with optimized uint8 quantization/dequantization ops for high bandwidth utilization. It also leverages PyTorch FSDP2/DTensor for ZeRO-3 sharding and CPU offloading for optimizer states.

Quick Start & Requirements

  • Install: curl -sSL https://raw.githubusercontent.com/PrimeIntellect-ai/prime/main/scripts/install/install.sh | bash followed by uv sync --extra all.
  • Prerequisites: Python 3.x, uv package manager, git, iperf, Hugging Face CLI login, and potentially downloading datasets (scripts/subset_data.py).
  • Setup Time: Varies based on dataset download and environment setup.
  • Docs: https://github.com/PrimeIntellect-ai/prime

Highlighted Details

  • Achieves up to 4Gb/s connections between data centers across the US using VPNs and optimized peer-to-peer connections.
  • Custom Int8 All-Reduce kernel with optimized uint8 ops improves quantization speed by over 60x.
  • Live checkpoint recovery allows nodes to join mid-training within a tight time window.
  • Implements PyTorch FSDP2/DTensor for ZeRO-3 sharding and CPU offloading.

Maintenance & Community

The project is actively developed by PrimeIntellect-ai. Further community and roadmap details are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The framework relies on specific environment variables for distributed setup and may require careful configuration of network settings (e.g., VPNs) for optimal performance. The setup process involves multiple steps and external scripts, indicating a potentially steep learning curve.

Health Check
Last commit

2 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
75 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.