distribuuuu  by BIGBALLON

PyTorch distributed training framework

Created 4 years ago
274 stars

Top 94.3% on SourcePulse

GitHubView on GitHub
Project Summary

Distribuuuu provides a streamlined framework for distributed PyTorch training, targeting researchers and engineers who need to scale model training across multiple GPUs and nodes. It simplifies the setup and execution of distributed training jobs, offering clear examples for various configurations and integrating with cluster schedulers like Slurm.

How It Works

The framework leverages native PyTorch distributed functionalities, including torch.distributed.launch and torch.multiprocessing, to manage multi-GPU and multi-node training. It utilizes yacs for flexible configuration management, allowing users to define hyperparameters and training settings via YAML files that can be easily overridden. This approach aims for clarity and ease of use in complex distributed setups.

Quick Start & Requirements

  • Install PyTorch (>= 1.6) and other dependencies via pip install -r requirements.txt.
  • Requires ImageNet dataset prepared with a specific directory structure and symlinks.
  • Basic usage involves python -m torch.distributed.launch with appropriate arguments for nodes, GPUs, and configuration files.
  • Official tutorials are available for detailed guidance.

Highlighted Details

  • Supports single-node multi-GPU (DataParallel, DistributedDataParallel) and multi-node multi-GPU training.
  • Integrates with Slurm for cluster job submission.
  • Provides baseline performance metrics for ResNet18, EfficientNet, and RegNet models trained with specific configurations.
  • Addresses the "zombie processes" issue common in older PyTorch versions.

Maintenance & Community

The project is maintained by Wei Li, with contributions welcomed via pull requests. Issues are accepted for suggestions or bug reports.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project relies on PyTorch versions >= 1.6, with specific fixes for zombie processes noted for PyTorch >= 1.8. The lack of an explicit license may pose restrictions for certain use cases.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
15 more.

torchtune by pytorch

0.2%
5k
PyTorch library for LLM post-training and experimentation
Created 1 year ago
Updated 1 day ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
20 more.

accelerate by huggingface

0.3%
9k
PyTorch training helper for distributed execution
Created 4 years ago
Updated 1 day ago
Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Alexey Milovidov Alexey Milovidov(Cofounder of Clickhouse), and
29 more.

llm.c by karpathy

0.2%
28k
LLM training in pure C/CUDA, no PyTorch needed
Created 1 year ago
Updated 2 months ago
Feedback? Help us improve.