stable-fast  by chengzeyi

Inference optimization framework for HuggingFace Diffusers

created 1 year ago
1,280 stars

Top 31.7% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an ultra-lightweight inference optimization framework for HuggingFace Diffusers on NVIDIA GPUs, targeting users who need to maximize inference speed and efficiency. It offers significantly faster compilation times compared to alternatives like TensorRT or torch.compile, while supporting dynamic shapes, LoRA, and ControlNet out-of-the-box.

How It Works

Stable-fast employs several key techniques to achieve its performance gains. These include CUDNN convolution fusion, low-precision fused GEMM operations, fused GEGLU kernels, and optimized NHWC GroupNorm with OpenAI's Triton. It also leverages CUDA Graphs for reduced CPU overhead with small batch sizes and dynamic shapes, and integrates xformers for fused multihead attention compatibility with TorchScript. The framework aims for minimal overhead by acting as a plugin for PyTorch, enhancing existing functionalities.

Quick Start & Requirements

  • Installation: Prebuilt wheels are available for Linux and Windows. Install with pip3 install --index-url https://download.pytorch.org/whl/cu121 'torch>=2.1.0' 'xformers>=0.0.22' 'triton>=2.1.0' 'diffusers>=0.19.3'. Installation from source requires CUDNN/CUBLAS and optionally Ninja.
  • Prerequisites: PyTorch with CUDA support (versions 1.12 to 2.1 tested, 2.1.0 recommended), xformers (>=0.0.22), and Triton (>=2.1.0) are recommended for best performance. CUDA 12.1 and Python 3.10 were used for testing.
  • Resources: Compilation is reported to take seconds, significantly faster than other methods.

Highlighted Details

  • Achieves SOTA inference performance on various diffuser models, including StableVideoDiffusionPipeline.
  • Supports dynamic shape, LoRA, and ControlNet natively.
  • Offers significantly faster compilation times (seconds) compared to TensorRT or AITemplate (minutes).
  • Benchmarks show substantial speedups over vanilla PyTorch and torch.compile on RTX 4080, H100, and A100 GPUs.

Maintenance & Community

Active development on stable-fast has been paused, with the author focusing on a new torch.dynamo-based project for newer models and broader hardware support. A Discord channel is available for community support.

Licensing & Compatibility

The project appears to be under a permissive license, though specific details are not explicitly stated in the README. It is compatible with various Hugging Face Diffusers versions, ControlNet, LoRA, LCM, SDXL Turbo, and Stable Video Diffusion.

Limitations & Caveats

The project's active development has been paused in favor of a new project. Compatibility with PyTorch versions outside the tested range (>=2.1.0) is not guaranteed. Progress bar accuracy may be affected by CUDA's asynchronous nature.

Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
32 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Zhuohan Li Zhuohan Li(Author of vLLM), and
6 more.

torchtitan by pytorch

0.9%
4k
PyTorch platform for generative AI model training research
created 1 year ago
updated 22 hours ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
6 more.

FasterTransformer by NVIDIA

0.2%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Feedback? Help us improve.