stable-fast  by chengzeyi

Inference optimization framework for HuggingFace Diffusers

Created 1 year ago
1,287 stars

Top 30.9% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an ultra-lightweight inference optimization framework for HuggingFace Diffusers on NVIDIA GPUs, targeting users who need to maximize inference speed and efficiency. It offers significantly faster compilation times compared to alternatives like TensorRT or torch.compile, while supporting dynamic shapes, LoRA, and ControlNet out-of-the-box.

How It Works

Stable-fast employs several key techniques to achieve its performance gains. These include CUDNN convolution fusion, low-precision fused GEMM operations, fused GEGLU kernels, and optimized NHWC GroupNorm with OpenAI's Triton. It also leverages CUDA Graphs for reduced CPU overhead with small batch sizes and dynamic shapes, and integrates xformers for fused multihead attention compatibility with TorchScript. The framework aims for minimal overhead by acting as a plugin for PyTorch, enhancing existing functionalities.

Quick Start & Requirements

  • Installation: Prebuilt wheels are available for Linux and Windows. Install with pip3 install --index-url https://download.pytorch.org/whl/cu121 'torch>=2.1.0' 'xformers>=0.0.22' 'triton>=2.1.0' 'diffusers>=0.19.3'. Installation from source requires CUDNN/CUBLAS and optionally Ninja.
  • Prerequisites: PyTorch with CUDA support (versions 1.12 to 2.1 tested, 2.1.0 recommended), xformers (>=0.0.22), and Triton (>=2.1.0) are recommended for best performance. CUDA 12.1 and Python 3.10 were used for testing.
  • Resources: Compilation is reported to take seconds, significantly faster than other methods.

Highlighted Details

  • Achieves SOTA inference performance on various diffuser models, including StableVideoDiffusionPipeline.
  • Supports dynamic shape, LoRA, and ControlNet natively.
  • Offers significantly faster compilation times (seconds) compared to TensorRT or AITemplate (minutes).
  • Benchmarks show substantial speedups over vanilla PyTorch and torch.compile on RTX 4080, H100, and A100 GPUs.

Maintenance & Community

Active development on stable-fast has been paused, with the author focusing on a new torch.dynamo-based project for newer models and broader hardware support. A Discord channel is available for community support.

Licensing & Compatibility

The project appears to be under a permissive license, though specific details are not explicitly stated in the README. It is compatible with various Hugging Face Diffusers versions, ControlNet, LoRA, LCM, SDXL Turbo, and Stable Video Diffusion.

Limitations & Caveats

The project's active development has been paused in favor of a new project. Compatibility with PyTorch versions outside the tested range (>=2.1.0) is not guaranteed. Progress bar accuracy may be affected by CUDA's asynchronous nature.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Tri Dao Tri Dao(Chief Scientist at Together AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
1 more.

oslo by tunib-ai

0%
309
Framework for large-scale transformer optimization
Created 3 years ago
Updated 3 years ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
11 more.

ctransformers by marella

0.1%
2k
Python bindings for fast Transformer model inference
Created 2 years ago
Updated 1 year ago
Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI) and Cody Yu Cody Yu(Coauthor of vLLM; MTS at OpenAI).

xDiT by xdit-project

0.7%
2k
Inference engine for parallel Diffusion Transformer (DiT) deployment
Created 1 year ago
Updated 1 day ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

FasterTransformer by NVIDIA

0.1%
6k
Optimized transformer library for inference
Created 4 years ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
13 more.

pytorch3d by facebookresearch

0.2%
10k
PyTorch3D is a PyTorch library for 3D deep learning research
Created 5 years ago
Updated 3 days ago
Feedback? Help us improve.