distribuuuu  by BIGBALLON

PyTorch distributed training framework

created 4 years ago
275 stars

Top 94.9% on sourcepulse

GitHubView on GitHub
Project Summary

Distribuuuu provides a streamlined framework for distributed PyTorch training, targeting researchers and engineers who need to scale model training across multiple GPUs and nodes. It simplifies the setup and execution of distributed training jobs, offering clear examples for various configurations and integrating with cluster schedulers like Slurm.

How It Works

The framework leverages native PyTorch distributed functionalities, including torch.distributed.launch and torch.multiprocessing, to manage multi-GPU and multi-node training. It utilizes yacs for flexible configuration management, allowing users to define hyperparameters and training settings via YAML files that can be easily overridden. This approach aims for clarity and ease of use in complex distributed setups.

Quick Start & Requirements

  • Install PyTorch (>= 1.6) and other dependencies via pip install -r requirements.txt.
  • Requires ImageNet dataset prepared with a specific directory structure and symlinks.
  • Basic usage involves python -m torch.distributed.launch with appropriate arguments for nodes, GPUs, and configuration files.
  • Official tutorials are available for detailed guidance.

Highlighted Details

  • Supports single-node multi-GPU (DataParallel, DistributedDataParallel) and multi-node multi-GPU training.
  • Integrates with Slurm for cluster job submission.
  • Provides baseline performance metrics for ResNet18, EfficientNet, and RegNet models trained with specific configurations.
  • Addresses the "zombie processes" issue common in older PyTorch versions.

Maintenance & Community

The project is maintained by Wei Li, with contributions welcomed via pull requests. Issues are accepted for suggestions or bug reports.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project relies on PyTorch versions >= 1.6, with specific fixes for zombie processes noted for PyTorch >= 1.8. The lack of an explicit license may pose restrictions for certain use cases.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Travis Fischer Travis Fischer(Founder of Agentic).

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
created 9 months ago
updated 2 weeks ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
8 more.

higgsfield by higgsfield-ai

0.3%
3k
ML framework for large model training and GPU orchestration
created 7 years ago
updated 1 year ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

serve by pytorch

0.1%
4k
Serve, optimize, and scale PyTorch models in production
created 5 years ago
updated 3 weeks ago
Feedback? Help us improve.