PyTorch library for fault-tolerant distributed training
Top 77.4% on sourcepulse
This repository provides per-step fault tolerance for PyTorch training, enabling continuous execution even when errors occur. It targets researchers and engineers working with large-scale distributed training, offering a robust framework to prevent job interruptions and improve training resilience.
How It Works
torchft implements fault tolerance by coordinating worker health via per-step heartbeating and providing re-initializable ProcessGroup
implementations. It utilizes a "lighthouse" server for membership management, allowing for dynamic replica group changes without halting the entire training process. This approach enables efficient recovery and scale-up operations by leveraging healthy peers for checkpointing.
Quick Start & Requirements
pip install .
or pip install -e '.[dev]'
for development.protobuf-compiler
, libprotobuf-dev
(Debian/Ubuntu) or protobuf-compiler
, protobuf-devel
(Red Hat), PyTorch 2.7 RC+ or Nightly.RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000
.TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29510 --nnodes 1 --nproc_per_node 1 train_ddp.py
.Highlighted Details
ProcessGroup
s, and checkpoint transports.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
This is an alpha prototype and may contain bugs or undergo breaking changes. LocalSGD and DiLoCo are experimental.
1 day ago
1 week