Discover and explore top open-source AI tools and projects—updated daily.
meta-pytorchPyTorch library for fault-tolerant distributed training
Top 65.1% on SourcePulse
This repository provides per-step fault tolerance for PyTorch training, enabling continuous execution even when errors occur. It targets researchers and engineers working with large-scale distributed training, offering a robust framework to prevent job interruptions and improve training resilience.
How It Works
torchft implements fault tolerance by coordinating worker health via per-step heartbeating and providing re-initializable ProcessGroup implementations. It utilizes a "lighthouse" server for membership management, allowing for dynamic replica group changes without halting the entire training process. This approach enables efficient recovery and scale-up operations by leveraging healthy peers for checkpointing.
Quick Start & Requirements
pip install . or pip install -e '.[dev]' for development.protobuf-compiler, libprotobuf-dev (Debian/Ubuntu) or protobuf-compiler, protobuf-devel (Red Hat), PyTorch 2.7 RC+ or Nightly.RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000.TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29510 --nnodes 1 --nproc_per_node 1 train_ddp.py.Highlighted Details
ProcessGroups, and checkpoint transports.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
This is an alpha prototype and may contain bugs or undergo breaking changes. LocalSGD and DiLoCo are experimental.
2 weeks ago
1 week
volcengine
PrimeIntellect-ai
ByteDance-Seed
mosaicml
pytorch