dlrover by intelligent-machine-learning

Distributed deep learning system for simplified large AI model training

Created 3 years ago

1,619 stars

Top 25.8% on SourcePulse

Project Summary

DLRover is an automated distributed deep learning system designed to simplify and stabilize the training of large AI models. It targets model developers who want to focus on model architecture rather than distributed systems engineering, offering features like fault tolerance, fast checkpointing, and auto-scaling for PyTorch and TensorFlow workloads on Kubernetes and Ray.

How It Works

DLRover operates by managing distributed training jobs, providing a layer of abstraction over underlying cluster orchestration. Its core innovation lies in its "Flash Checkpoint" mechanism, which saves and loads checkpoints from host memory asynchronously or upon failure, drastically reducing recovery time from minutes to seconds. Fault tolerance is achieved through intelligent failure diagnosis and process/node restarts, improving job completion rates. Auto-scaling dynamically adjusts cluster resources based on observed bottlenecks and throughput, optimizing utilization and performance. Dynamic data sharding ensures data availability during worker failures and allows for faster workers to process more data.

Quick Start & Requirements

Install: pip install dlrover[torch]
Prerequisites: Python, Kubernetes or Ray cluster.
Tutorials: Elastic scheduling, Flash Checkpoint, PyTorch on Kubernetes

Highlighted Details

Achieved 95% job completion rate for TensorFlow PS training, up from 89% with tf-operator.
Improved GLM-65B training goodput from 69% to 95% through fault tolerance.
Flash Checkpoint recovers large model training (e.g., GPT2-1.5B) in seconds by loading from shared memory.
Offers extension libraries ATorch (PyTorch) and TFPlus (TensorFlow) for accelerated training.

Maintenance & Community

Active development with recent publications and accepted papers (ICLR'25, VLDB'24, KDD'23).
Community channels available via DingTalk and WeChat.

Licensing & Compatibility

Apache 2.0 License. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Primarily targets Kubernetes and Ray environments.
Some advanced features like multi-node in-memory redundant backup checkpointing are listed as "What's Next".

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

18 stars in the last 30 days