stable-fast by chengzeyi

Inference optimization framework for HuggingFace Diffusers

Created 2 years ago

1,295 stars

Top 30.7% on SourcePulse

View on GitHub

6 Experts Love This Project

Philipp Schmid

DevRel at Google DeepMind

Luis Capelo

Cofounder of Lightning AI

Kaichao You

Core Maintainer of vLLM

Georgios Konstantopoulos

CTO, General Partner at Paradigm

and 2 more!

Project Summary

This project provides an ultra-lightweight inference optimization framework for HuggingFace Diffusers on NVIDIA GPUs, targeting users who need to maximize inference speed and efficiency. It offers significantly faster compilation times compared to alternatives like TensorRT or torch.compile, while supporting dynamic shapes, LoRA, and ControlNet out-of-the-box.

How It Works

Stable-fast employs several key techniques to achieve its performance gains. These include CUDNN convolution fusion, low-precision fused GEMM operations, fused GEGLU kernels, and optimized NHWC GroupNorm with OpenAI's Triton. It also leverages CUDA Graphs for reduced CPU overhead with small batch sizes and dynamic shapes, and integrates xformers for fused multihead attention compatibility with TorchScript. The framework aims for minimal overhead by acting as a plugin for PyTorch, enhancing existing functionalities.

Quick Start & Requirements

Installation: Prebuilt wheels are available for Linux and Windows. Install with pip3 install --index-url https://download.pytorch.org/whl/cu121 'torch>=2.1.0' 'xformers>=0.0.22' 'triton>=2.1.0' 'diffusers>=0.19.3'. Installation from source requires CUDNN/CUBLAS and optionally Ninja.
Prerequisites: PyTorch with CUDA support (versions 1.12 to 2.1 tested, 2.1.0 recommended), xformers (>=0.0.22), and Triton (>=2.1.0) are recommended for best performance. CUDA 12.1 and Python 3.10 were used for testing.
Resources: Compilation is reported to take seconds, significantly faster than other methods.

Highlighted Details

Achieves SOTA inference performance on various diffuser models, including StableVideoDiffusionPipeline.
Supports dynamic shape, LoRA, and ControlNet natively.
Offers significantly faster compilation times (seconds) compared to TensorRT or AITemplate (minutes).
Benchmarks show substantial speedups over vanilla PyTorch and torch.compile on RTX 4080, H100, and A100 GPUs.

Maintenance & Community

Active development on stable-fast has been paused, with the author focusing on a new torch.dynamo-based project for newer models and broader hardware support. A Discord channel is available for community support.

Licensing & Compatibility

The project appears to be under a permissive license, though specific details are not explicitly stated in the README. It is compatible with various Hugging Face Diffusers versions, ControlNet, LoRA, LCM, SDXL Turbo, and Stable Video Diffusion.

Limitations & Caveats

The project's active development has been paused in favor of a new project. Compatibility with PyTorch versions outside the tested range (>=2.1.0) is not guaranteed. Progress bar accuracy may be affected by CUDA's asynchronous nature.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days