dlrover  by intelligent-machine-learning

Distributed deep learning system for simplified large AI model training

Created 3 years ago
1,619 stars

Top 25.8% on SourcePulse

GitHubView on GitHub
Project Summary

DLRover is an automated distributed deep learning system designed to simplify and stabilize the training of large AI models. It targets model developers who want to focus on model architecture rather than distributed systems engineering, offering features like fault tolerance, fast checkpointing, and auto-scaling for PyTorch and TensorFlow workloads on Kubernetes and Ray.

How It Works

DLRover operates by managing distributed training jobs, providing a layer of abstraction over underlying cluster orchestration. Its core innovation lies in its "Flash Checkpoint" mechanism, which saves and loads checkpoints from host memory asynchronously or upon failure, drastically reducing recovery time from minutes to seconds. Fault tolerance is achieved through intelligent failure diagnosis and process/node restarts, improving job completion rates. Auto-scaling dynamically adjusts cluster resources based on observed bottlenecks and throughput, optimizing utilization and performance. Dynamic data sharding ensures data availability during worker failures and allows for faster workers to process more data.

Quick Start & Requirements

Highlighted Details

  • Achieved 95% job completion rate for TensorFlow PS training, up from 89% with tf-operator.
  • Improved GLM-65B training goodput from 69% to 95% through fault tolerance.
  • Flash Checkpoint recovers large model training (e.g., GPT2-1.5B) in seconds by loading from shared memory.
  • Offers extension libraries ATorch (PyTorch) and TFPlus (TensorFlow) for accelerated training.

Maintenance & Community

  • Active development with recent publications and accepted papers (ICLR'25, VLDB'24, KDD'23).
  • Community channels available via DingTalk and WeChat.

Licensing & Compatibility

  • Apache 2.0 License. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Primarily targets Kubernetes and Ray environments.
  • Some advanced features like multi-node in-memory redundant backup checkpointing are listed as "What's Next".
Health Check
Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
10
Issues (30d)
6
Star History
18 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

VeOmni by ByteDance-Seed

1.8%
2k
Framework for scaling multimodal model training across accelerators
Created 9 months ago
Updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Woosuk Kwon Woosuk Kwon(Coauthor of vLLM), and
15 more.

torchtitan by pytorch

0.6%
5k
PyTorch platform for generative AI model training research
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.