torchft  by pytorch

PyTorch library for fault-tolerant distributed training

created 9 months ago
371 stars

Top 77.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides per-step fault tolerance for PyTorch training, enabling continuous execution even when errors occur. It targets researchers and engineers working with large-scale distributed training, offering a robust framework to prevent job interruptions and improve training resilience.

How It Works

torchft implements fault tolerance by coordinating worker health via per-step heartbeating and providing re-initializable ProcessGroup implementations. It utilizes a "lighthouse" server for membership management, allowing for dynamic replica group changes without halting the entire training process. This approach enables efficient recovery and scale-up operations by leveraging healthy peers for checkpointing.

Quick Start & Requirements

  • Install: pip install . or pip install -e '.[dev]' for development.
  • Prerequisites: Rust, protobuf-compiler, libprotobuf-dev (Debian/Ubuntu) or protobuf-compiler, protobuf-devel (Red Hat), PyTorch 2.7 RC+ or Nightly.
  • Usage: Start lighthouse with RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000.
  • Example DDP training: TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29510 --nnodes 1 --nproc_per_node 1 train_ddp.py.
  • Documentation: Documentation

Highlighted Details

  • Implements Fault Tolerant DDP and Fault Tolerant HSDP (e.g., Llama 3 70B with torchtitan).
  • Provides reusable components: coordination primitives, fault-tolerant ProcessGroups, and checkpoint transports.
  • Supports LocalSGD and DiLoCo algorithms (experimental).
  • Offers a fault-tolerant parameter server implementation independent of the lighthouse.

Maintenance & Community

  • Actively under development (alpha prototype).
  • Contributions are welcome; reach out for collaboration.

Licensing & Compatibility

  • BSD 3-Clause licensed. Permissive for commercial use and closed-source linking.

Limitations & Caveats

This is an alpha prototype and may contain bugs or undergo breaking changes. LocalSGD and DiLoCo are experimental.

Health Check
Last commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)
26
Issues (30d)
1
Star History
82 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Travis Fischer Travis Fischer(Founder of Agentic).

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
created 9 months ago
updated 2 weeks ago
Starred by Logan Kilpatrick Logan Kilpatrick(Product Lead on Google AI Studio), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

catalyst by catalyst-team

0%
3k
PyTorch framework for accelerated deep learning R&D
created 7 years ago
updated 1 month ago
Feedback? Help us improve.