nano-deepspeed  by zv1131860787

A ZeRO teaching implementation for understanding distributed training

Created 1 month ago
272 stars

Top 94.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

nano-deepspeed is a teaching-focused re-implementation of DeepSpeed ZeRO, designed for understanding data flow and communication mechanisms. It targets engineers and researchers interested in learning ZeRO principles rather than production deployment. The project offers readable code and explainable behavior for small-scale comparisons with official DeepSpeed.

How It Works

This project provides a simplified, educational re-implementation of DeepSpeed's ZeRO optimizer. It focuses on core ZeRO stages 0, 1, and 2, utilizing AdamW and FP16 dynamic loss scaling. The stage 2 implementation highlights communication patterns through packed all-reduce and local scatter-back operations, prioritizing code clarity over production-level optimizations.

Quick Start & Requirements

Requires Python 3.9+, PyTorch (CUDA build recommended), and the transformers library for examples. Official DeepSpeed is only needed for comparative runs. Installation involves pip install torch transformers and pip install deepspeed. Quick start commands are provided for single and multi-GPU setups, demonstrating usage with torchrun and example configuration files.

Highlighted Details

  • Supports ZeRO stages 0, 1, and 2 with FP16 dynamic loss scaling.
  • Stage 2 communication employs packed all-reduce and local scatter-back.
  • Experimental results indicate nano-deepspeed uses more memory than official DeepSpeed under tested configurations (e.g., 2-GPU, 8-GPU), while achieving comparable final losses, suggesting stability.
  • Official DeepSpeed achieves lower memory usage through advanced communication, memory reuse, and scheduling optimizations.

Maintenance & Community

The roadmap outlines future improvements including enhanced ZeRO-2 path tooling, and minimal teaching implementations for ZeRO-3 and offload capabilities. No specific community channels (e.g., Discord, Slack) or notable contributors are listed.

Licensing & Compatibility

The license type is not explicitly stated in the provided README. Compatibility for commercial use or linking with closed-source projects is not detailed.

Limitations & Caveats

This project is strictly for learning and research, not production workloads. It lacks full feature parity with official DeepSpeed, notably omitting ZeRO-3, optimizer/parameter offload, and advanced ecosystem features like MoE or pipeline/tensor parallelism. Engineering robustness for fault tolerance and extreme-scale stability is also simpler compared to the official implementation.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
218 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
7 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 1 year ago
Updated 8 months ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
20 more.

accelerate by huggingface

0.1%
10k
PyTorch training helper for distributed execution
Created 5 years ago
Updated 1 day ago
Feedback? Help us improve.