PyTorch distributed training framework
Top 94.9% on sourcepulse
Distribuuuu provides a streamlined framework for distributed PyTorch training, targeting researchers and engineers who need to scale model training across multiple GPUs and nodes. It simplifies the setup and execution of distributed training jobs, offering clear examples for various configurations and integrating with cluster schedulers like Slurm.
How It Works
The framework leverages native PyTorch distributed functionalities, including torch.distributed.launch
and torch.multiprocessing
, to manage multi-GPU and multi-node training. It utilizes yacs
for flexible configuration management, allowing users to define hyperparameters and training settings via YAML files that can be easily overridden. This approach aims for clarity and ease of use in complex distributed setups.
Quick Start & Requirements
pip install -r requirements.txt
.python -m torch.distributed.launch
with appropriate arguments for nodes, GPUs, and configuration files.Highlighted Details
Maintenance & Community
The project is maintained by Wei Li, with contributions welcomed via pull requests. Issues are accepted for suggestions or bug reports.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project relies on PyTorch versions >= 1.6, with specific fixes for zombie processes noted for PyTorch >= 1.8. The lack of an explicit license may pose restrictions for certain use cases.
1 year ago
Inactive