dlrover  by intelligent-machine-learning

Distributed deep learning system for simplified large AI model training

Created 3 years ago
1,552 stars

Top 26.9% on SourcePulse

GitHubView on GitHub
Project Summary

DLRover is an automated distributed deep learning system designed to simplify and stabilize the training of large AI models. It targets model developers who want to focus on model architecture rather than distributed systems engineering, offering features like fault tolerance, fast checkpointing, and auto-scaling for PyTorch and TensorFlow workloads on Kubernetes and Ray.

How It Works

DLRover operates by managing distributed training jobs, providing a layer of abstraction over underlying cluster orchestration. Its core innovation lies in its "Flash Checkpoint" mechanism, which saves and loads checkpoints from host memory asynchronously or upon failure, drastically reducing recovery time from minutes to seconds. Fault tolerance is achieved through intelligent failure diagnosis and process/node restarts, improving job completion rates. Auto-scaling dynamically adjusts cluster resources based on observed bottlenecks and throughput, optimizing utilization and performance. Dynamic data sharding ensures data availability during worker failures and allows for faster workers to process more data.

Quick Start & Requirements

Highlighted Details

  • Achieved 95% job completion rate for TensorFlow PS training, up from 89% with tf-operator.
  • Improved GLM-65B training goodput from 69% to 95% through fault tolerance.
  • Flash Checkpoint recovers large model training (e.g., GPT2-1.5B) in seconds by loading from shared memory.
  • Offers extension libraries ATorch (PyTorch) and TFPlus (TensorFlow) for accelerated training.

Maintenance & Community

  • Active development with recent publications and accepted papers (ICLR'25, VLDB'24, KDD'23).
  • Community channels available via DingTalk and WeChat.

Licensing & Compatibility

  • Apache 2.0 License. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Primarily targets Kubernetes and Ray environments.
  • Some advanced features like multi-node in-memory redundant backup checkpointing are listed as "What's Next".
Health Check
Last Commit

14 hours ago

Responsiveness

1 week

Pull Requests (30d)
27
Issues (30d)
9
Star History
31 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

VeOmni by ByteDance-Seed

3.4%
1k
Framework for scaling multimodal model training across accelerators
Created 5 months ago
Updated 3 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
13 more.

torchtitan by pytorch

0.7%
4k
PyTorch platform for generative AI model training research
Created 1 year ago
Updated 21 hours ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
26 more.

ColossalAI by hpcaitech

0.1%
41k
AI system for large-scale parallel training
Created 3 years ago
Updated 14 hours ago
Feedback? Help us improve.