diffusion-pipe  by tdrussell

Pipeline parallel training script for diffusion models

created 1 year ago
1,323 stars

Top 31.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a pipeline-parallel training script for diffusion models, targeting researchers and practitioners needing to train large models that exceed single-GPU memory. It offers efficient multi-GPU training with features like checkpointing, pre-caching, and unified support for image and video models, simplifying the process of training advanced generative AI.

How It Works

The script leverages DeepSpeed's pipeline parallelism to partition model layers across multiple GPUs, enabling training of models too large for a single device. It incorporates hybrid data and pipeline parallelism, allowing flexible configuration of model distribution. Key optimizations include pre-caching latents and text embeddings to disk, freeing up VRAM by offloading VAE and text encoders during training.

Quick Start & Requirements

  • Install: Clone repo with submodules (git clone --recurse-submodules), create Conda environment (conda create -n diffusion-pipe python=3.12), activate (conda activate diffusion-pipe), install dependencies (pip install -r requirements.txt).
  • Prerequisites: Python 3.12, CUDA (matching PyTorch), GCC 12 (for TransformerEngine), CUDNN. TransformerEngine is required for Cosmos.
  • Setup: Requires careful environment setup, especially for TransformerEngine.
  • Docs: Supported Models

Highlighted Details

  • Supports a wide range of models including SDXL, Flux, LTX-Video, HunyuanVideo, Cosmos, Lumina, Wan, and Chroma.
  • Features block swapping and NF4 quantization for significantly reduced VRAM usage, enabling LoRA training on single 4090 GPUs.
  • Offers unified support for both image and video models, with flexible configuration via TOML files.
  • Includes efficient multi-process, multi-GPU pre-caching of latents and text embeddings.

Maintenance & Community

This is noted as a side project with limited developer time. Recent updates show community contributions (PRs) for new models and features.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

Native Windows support is difficult/impossible due to DeepSpeed's limited Windows compatibility; WSL 2 is recommended. Pre-caching latents means text encoder LoRA training is not currently supported. Resuming training requires using the original command-line config file.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
34
Star History
379 stars in the last 90 days

Explore Similar Projects

Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

SwissArmyTransformer by THUDM

0.3%
1k
Transformer library for flexible model development
created 3 years ago
updated 7 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Zhuohan Li Zhuohan Li(Author of vLLM), and
6 more.

torchtitan by pytorch

0.9%
4k
PyTorch platform for generative AI model training research
created 1 year ago
updated 1 day ago
Feedback? Help us improve.