dlrover  by intelligent-machine-learning

Distributed deep learning system for simplified large AI model training

created 3 years ago
1,514 stars

Top 27.8% on sourcepulse

GitHubView on GitHub
Project Summary

DLRover is an automated distributed deep learning system designed to simplify and stabilize the training of large AI models. It targets model developers who want to focus on model architecture rather than distributed systems engineering, offering features like fault tolerance, fast checkpointing, and auto-scaling for PyTorch and TensorFlow workloads on Kubernetes and Ray.

How It Works

DLRover operates by managing distributed training jobs, providing a layer of abstraction over underlying cluster orchestration. Its core innovation lies in its "Flash Checkpoint" mechanism, which saves and loads checkpoints from host memory asynchronously or upon failure, drastically reducing recovery time from minutes to seconds. Fault tolerance is achieved through intelligent failure diagnosis and process/node restarts, improving job completion rates. Auto-scaling dynamically adjusts cluster resources based on observed bottlenecks and throughput, optimizing utilization and performance. Dynamic data sharding ensures data availability during worker failures and allows for faster workers to process more data.

Quick Start & Requirements

Highlighted Details

  • Achieved 95% job completion rate for TensorFlow PS training, up from 89% with tf-operator.
  • Improved GLM-65B training goodput from 69% to 95% through fault tolerance.
  • Flash Checkpoint recovers large model training (e.g., GPT2-1.5B) in seconds by loading from shared memory.
  • Offers extension libraries ATorch (PyTorch) and TFPlus (TensorFlow) for accelerated training.

Maintenance & Community

  • Active development with recent publications and accepted papers (ICLR'25, VLDB'24, KDD'23).
  • Community channels available via DingTalk and WeChat.

Licensing & Compatibility

  • Apache 2.0 License. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Primarily targets Kubernetes and Ray environments.
  • Some advanced features like multi-node in-memory redundant backup checkpointing are listed as "What's Next".
Health Check
Last commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)
19
Issues (30d)
3
Star History
92 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Zhiqiang Xie Zhiqiang Xie(Author of SGLang).

veScale by volcengine

0.1%
839
PyTorch-native framework for LLM training
created 1 year ago
updated 3 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Ben Firshman Ben Firshman(Cofounder of Replicate), and
6 more.

Made-With-ML by GokuMohandas

0.4%
41k
ML course for production-grade applications
created 6 years ago
updated 11 months ago
Feedback? Help us improve.