torchft  by pytorch

PyTorch library for fault-tolerant distributed training

Created 11 months ago
401 stars

Top 72.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides per-step fault tolerance for PyTorch training, enabling continuous execution even when errors occur. It targets researchers and engineers working with large-scale distributed training, offering a robust framework to prevent job interruptions and improve training resilience.

How It Works

torchft implements fault tolerance by coordinating worker health via per-step heartbeating and providing re-initializable ProcessGroup implementations. It utilizes a "lighthouse" server for membership management, allowing for dynamic replica group changes without halting the entire training process. This approach enables efficient recovery and scale-up operations by leveraging healthy peers for checkpointing.

Quick Start & Requirements

  • Install: pip install . or pip install -e '.[dev]' for development.
  • Prerequisites: Rust, protobuf-compiler, libprotobuf-dev (Debian/Ubuntu) or protobuf-compiler, protobuf-devel (Red Hat), PyTorch 2.7 RC+ or Nightly.
  • Usage: Start lighthouse with RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000.
  • Example DDP training: TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29510 --nnodes 1 --nproc_per_node 1 train_ddp.py.
  • Documentation: Documentation

Highlighted Details

  • Implements Fault Tolerant DDP and Fault Tolerant HSDP (e.g., Llama 3 70B with torchtitan).
  • Provides reusable components: coordination primitives, fault-tolerant ProcessGroups, and checkpoint transports.
  • Supports LocalSGD and DiLoCo algorithms (experimental).
  • Offers a fault-tolerant parameter server implementation independent of the lighthouse.

Maintenance & Community

  • Actively under development (alpha prototype).
  • Contributions are welcome; reach out for collaboration.

Licensing & Compatibility

  • BSD 3-Clause licensed. Permissive for commercial use and closed-source linking.

Limitations & Caveats

This is an alpha prototype and may contain bugs or undergo breaking changes. LocalSGD and DiLoCo are experimental.

Health Check
Last Commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)
7
Issues (30d)
0
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

VeOmni by ByteDance-Seed

3.4%
1k
Framework for scaling multimodal model training across accelerators
Created 5 months ago
Updated 3 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
13 more.

torchtitan by pytorch

0.7%
4k
PyTorch platform for generative AI model training research
Created 1 year ago
Updated 19 hours ago
Feedback? Help us improve.