diffusers-torchao by sayakpaul

Recipes for optimizing diffusion models with torchao and diffusers

Created 1 year ago

388 stars

Top 73.9% on SourcePulse

Project Summary

This repository provides end-to-end recipes for optimizing diffusion models using torchao and Hugging Face diffusers, enabling faster inference and experimental FP8 training. It targets researchers and engineers working with large diffusion models who need to reduce computational costs and latency. The primary benefit is significant speedups and memory savings through quantization and compilation.

How It Works

The project leverages torchao for quantization (e.g., INT8, FP8, FP6, FP4) and torch.compile() for graph optimization. It demonstrates how to apply these techniques to popular diffusion models like Flux and CogVideoX. The approach involves integrating torchao's quantization capabilities directly into diffusers pipelines, allowing for fine-grained control over quantization schemes and compilation modes to achieve optimal performance and memory footprints.

Quick Start & Requirements

Install: pip install -e . (from the cloned repository)
Prerequisites: PyTorch nightly, diffusers nightly, torchao nightly, CUDA 12.2+. Experiments were conducted on NVIDIA A100 and H100 GPUs.
Setup: Requires cloning the repository and installing dependencies.
More Info: Diffusers Documentation, TorchAO Documentation

Highlighted Details

Achieves up to 53.88% speedup on Flux.1-Dev and 33.04% on CogVideoX-5b on H100 GPUs compared to standard bf16.
Demonstrates significant memory reduction, e.g., CogVideoX-5b requiring ~10.3 GB model memory with INT8 weight-only quantization vs. ~19.7 GB for bf16.
Explores various quantization dtypes (INT8, FP8, FP6, FP4) and their impact on speed, memory, and quality.
Integrates torch.compile() for further performance gains, including strategies to avoid graph breaks.

Maintenance & Community

Actively developed with contributions acknowledged from the PyTorch team.
torchao is being integrated as an official quantization backend in diffusers.

Licensing & Compatibility

The repository itself does not explicitly state a license in the README. diffusers is typically Apache 2.0 licensed, and torchao is typically BSD 3-Clause licensed. Compatibility for commercial use is likely, but specific terms should be verified.

Limitations & Caveats

Experimental FP8 training is mentioned but not detailed.
Semi-structured sparsity with INT8 dynamic quantization can significantly degrade image quality.
Quantization support is best on Ampere and newer architectures; Turing/Volta and Apple MPS backends may have issues or offer no benefits.
Benchmarking scripts can be time-consuming due to compilation and warmup runs.

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days