torchft by meta-pytorch

PyTorch library for fault-tolerant distributed training

Created 1 year ago

467 stars

Top 65.1% on SourcePulse

4 Experts Love This Project

hammer

Jeff Hammerbacher

Cofounder of Cloudera

gakonst

Georgios Konstantopoulos

CTO, General Partner at Paradigm

stas00

Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake

JohannesHa

Johannes Hagemann

Cofounder of Prime Intellect

Project Summary

This repository provides per-step fault tolerance for PyTorch training, enabling continuous execution even when errors occur. It targets researchers and engineers working with large-scale distributed training, offering a robust framework to prevent job interruptions and improve training resilience.

How It Works

torchft implements fault tolerance by coordinating worker health via per-step heartbeating and providing re-initializable ProcessGroup implementations. It utilizes a "lighthouse" server for membership management, allowing for dynamic replica group changes without halting the entire training process. This approach enables efficient recovery and scale-up operations by leveraging healthy peers for checkpointing.

Quick Start & Requirements

Install: pip install . or pip install -e '.[dev]' for development.
Prerequisites: Rust, protobuf-compiler, libprotobuf-dev (Debian/Ubuntu) or protobuf-compiler, protobuf-devel (Red Hat), PyTorch 2.7 RC+ or Nightly.
Usage: Start lighthouse with RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000.
Example DDP training: TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29510 --nnodes 1 --nproc_per_node 1 train_ddp.py.
Documentation: Documentation

Highlighted Details

Implements Fault Tolerant DDP and Fault Tolerant HSDP (e.g., Llama 3 70B with torchtitan).
Provides reusable components: coordination primitives, fault-tolerant ProcessGroups, and checkpoint transports.
Supports LocalSGD and DiLoCo algorithms (experimental).
Offers a fault-tolerant parameter server implementation independent of the lighthouse.

Maintenance & Community

Actively under development (alpha prototype).
Contributions are welcome; reach out for collaboration.

Licensing & Compatibility

BSD 3-Clause licensed. Permissive for commercial use and closed-source linking.

Limitations & Caveats

This is an alpha prototype and may contain bugs or undergo breaking changes. LocalSGD and DiLoCo are experimental.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)

5

Issues (30d)

2

Star History

11 stars in the last 30 days

Explore Similar Projects

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Wing Lian

Wing Lian(Founder of Axolotl AI), and

1 more.

relora by Guitaricet

PEFT pretraining code for ReLoRA research paper

Created 2 years ago

Updated 1 year ago

sre-roadmap by teivah

SRE roadmap for distributed systems concepts

Created 2 years ago

Updated 1 year ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen),

Casper Hansen

Casper Hansen(Author of AutoAWQ), and

4 more.

veScale by volcengine

PyTorch-native framework for LLM training

Created 1 year ago

Updated 1 month ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

1 more.

libai by Oneflow-Inc

Large-scale distributed parallel training toolbox

Created 4 years ago

Updated 5 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI) and

Chuan Li

Chuan Li(Chief Scientific Officer at Lambda).

distributed-training-guide by LambdaLabsML

PyTorch guide for distributed training of large language models

Created 1 year ago

Updated 2 months ago

Starred by

Anton Troynikov

Anton Troynikov(Cofounder of Chroma),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

6 more.

prime-diloco by PrimeIntellect-ai

Framework for distributed AI model training over the internet

Created 1 year ago

Updated 1 month ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and

1 more.

VeOmni by ByteDance-Seed

Framework for scaling multimodal model training across accelerators

Created 9 months ago

Updated 1 day ago

dlrover by intelligent-machine-learning

Distributed deep learning system for simplified large AI model training

Created 3 years ago

Updated 3 days ago

Starred by

Jeff Huber

Jeff Huber(Cofounder of Chroma),

Jordan Burgess

Jordan Burgess(Cofounder of Humanloop), and

8 more.

hatchet by hatchet-dev

Background task platform for scalable workflows

Created 2 years ago

Updated 14 hours ago

Starred by

Aravind Srinivas

Aravind Srinivas(Cofounder of Perplexity),

Eiso Kant

Eiso Kant(Cofounder of Poolside AI), and

20 more.

composer by mosaicml

DL framework for training at scale, optimized for large-scale clusters

Created 4 years ago

Updated 1 month ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Woosuk Kwon

Woosuk Kwon(Coauthor of vLLM), and

15 more.

torchtitan by pytorch

PyTorch platform for generative AI model training research

Created 2 years ago

Updated 1 day ago

Starred by

Emil Ernerfeldt

Emil Ernerfeldt(Cofounder of Rerun),

Jesse Clark

Jesse Clark(Cofounder of Marqo), and

8 more.

burn by tracel-ai

Deep learning framework prioritizing flexibility, efficiency, and portability

Created 3 years ago

Updated 2 days ago

Feedback? Help us improve.