diffusion-pipe by tdrussell

Pipeline parallel training script for diffusion models

Created 1 year ago

1,799 stars

Top 23.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Andreas Jansson

Cofounder of Replicate

Project Summary

This project provides a pipeline-parallel training script for diffusion models, targeting researchers and practitioners needing to train large models that exceed single-GPU memory. It offers efficient multi-GPU training with features like checkpointing, pre-caching, and unified support for image and video models, simplifying the process of training advanced generative AI.

How It Works

The script leverages DeepSpeed's pipeline parallelism to partition model layers across multiple GPUs, enabling training of models too large for a single device. It incorporates hybrid data and pipeline parallelism, allowing flexible configuration of model distribution. Key optimizations include pre-caching latents and text embeddings to disk, freeing up VRAM by offloading VAE and text encoders during training.

Quick Start & Requirements

Install: Clone repo with submodules (git clone --recurse-submodules), create Conda environment (conda create -n diffusion-pipe python=3.12), activate (conda activate diffusion-pipe), install dependencies (pip install -r requirements.txt).
Prerequisites: Python 3.12, CUDA (matching PyTorch), GCC 12 (for TransformerEngine), CUDNN. TransformerEngine is required for Cosmos.
Setup: Requires careful environment setup, especially for TransformerEngine.
Docs: Supported Models

Highlighted Details

Supports a wide range of models including SDXL, Flux, LTX-Video, HunyuanVideo, Cosmos, Lumina, Wan, and Chroma.
Features block swapping and NF4 quantization for significantly reduced VRAM usage, enabling LoRA training on single 4090 GPUs.
Offers unified support for both image and video models, with flexible configuration via TOML files.
Includes efficient multi-process, multi-GPU pre-caching of latents and text embeddings.

Maintenance & Community

This is noted as a side project with limited developer time. Recent updates show community contributions (PRs) for new models and features.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

Native Windows support is difficult/impossible due to DeepSpeed's limited Windows compatibility; WSL 2 is recommended. Pre-caching latents means text encoder LoRA training is not currently supported. Resuming training requires using the original command-line config file.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

37 stars in the last 30 days