prime  by PrimeIntellect-ai

Framework for distributed AI model training over the internet

Created 1 year ago
820 stars

Top 43.3% on SourcePulse

GitHubView on GitHub
Project Summary

Prime is a framework for efficient, globally distributed AI model training over the internet, designed for researchers and engineers tackling large-scale distributed training challenges. It addresses the complexities of fault tolerance, checkpointing, and communication overhead in decentralized environments, enabling more robust and scalable training.

How It Works

Prime introduces ElasticDeviceMesh for fault-tolerant, dynamic process group management across the internet, using heartbeats to detect and remove dead nodes without crashing. It implements asynchronous distributed checkpointing by first saving to RAM (/dev/shm) for speed, then copying to disk and remote storage asynchronously. A custom C++ Int8 All-Reduce kernel is provided for 4x payload reduction, with optimized uint8 quantization/dequantization ops for high bandwidth utilization. It also leverages PyTorch FSDP2/DTensor for ZeRO-3 sharding and CPU offloading for optimizer states.

Quick Start & Requirements

  • Install: curl -sSL https://raw.githubusercontent.com/PrimeIntellect-ai/prime/main/scripts/install/install.sh | bash followed by uv sync --extra all.
  • Prerequisites: Python 3.x, uv package manager, git, iperf, Hugging Face CLI login, and potentially downloading datasets (scripts/subset_data.py).
  • Setup Time: Varies based on dataset download and environment setup.
  • Docs: https://github.com/PrimeIntellect-ai/prime

Highlighted Details

  • Achieves up to 4Gb/s connections between data centers across the US using VPNs and optimized peer-to-peer connections.
  • Custom Int8 All-Reduce kernel with optimized uint8 ops improves quantization speed by over 60x.
  • Live checkpoint recovery allows nodes to join mid-training within a tight time window.
  • Implements PyTorch FSDP2/DTensor for ZeRO-3 sharding and CPU offloading.

Maintenance & Community

The project is actively developed by PrimeIntellect-ai. Further community and roadmap details are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The framework relies on specific environment variables for distributed setup and may require careful configuration of network settings (e.g., VPNs) for optimal performance. The setup process involves multiple steps and external scripts, indicating a potentially steep learning curve.

Health Check
Last Commit

3 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
22 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
20 more.

alpa by alpa-projects

0.0%
3k
Auto-parallelization framework for large-scale neural network training and serving
Created 4 years ago
Updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
13 more.

torchtitan by pytorch

0.7%
4k
PyTorch platform for generative AI model training research
Created 1 year ago
Updated 19 hours ago
Feedback? Help us improve.