distribuuuu by BIGBALLON

PyTorch distributed training framework

Created 5 years ago

275 stars

Top 94.1% on SourcePulse

Project Summary

Distribuuuu provides a streamlined framework for distributed PyTorch training, targeting researchers and engineers who need to scale model training across multiple GPUs and nodes. It simplifies the setup and execution of distributed training jobs, offering clear examples for various configurations and integrating with cluster schedulers like Slurm.

How It Works

The framework leverages native PyTorch distributed functionalities, including torch.distributed.launch and torch.multiprocessing, to manage multi-GPU and multi-node training. It utilizes yacs for flexible configuration management, allowing users to define hyperparameters and training settings via YAML files that can be easily overridden. This approach aims for clarity and ease of use in complex distributed setups.

Quick Start & Requirements

Install PyTorch (>= 1.6) and other dependencies via pip install -r requirements.txt.
Requires ImageNet dataset prepared with a specific directory structure and symlinks.
Basic usage involves python -m torch.distributed.launch with appropriate arguments for nodes, GPUs, and configuration files.
Official tutorials are available for detailed guidance.

Highlighted Details

Supports single-node multi-GPU (DataParallel, DistributedDataParallel) and multi-node multi-GPU training.
Integrates with Slurm for cluster job submission.
Provides baseline performance metrics for ResNet18, EfficientNet, and RegNet models trained with specific configurations.
Addresses the "zombie processes" issue common in older PyTorch versions.

Maintenance & Community

The project is maintained by Wei Li, with contributions welcomed via pull requests. Issues are accepted for suggestions or bug reports.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project relies on PyTorch versions >= 1.6, with specific fixes for zombie processes noted for PyTorch >= 1.8. The lack of an explicit license may pose restrictions for certain use cases.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days