x-stable-diffusion  by stochasticai

CLI tool for Stable Diffusion acceleration techniques

created 2 years ago
557 stars

Top 58.4% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides optimized inference pipelines for Stable Diffusion, targeting users who need to generate images with significantly reduced latency and resource consumption. It offers a command-line interface (CLI) for easy deployment and inference, enabling faster and more cost-effective image generation.

How It Works

The project integrates several key acceleration techniques: Meta's AITemplate, NVIDIA's TensorRT, nvFuser, and FlashAttention (via Xformers). These frameworks are applied to optimize the Stable Diffusion model, aiming for lower latency and reduced VRAM usage. The approach allows users to select the best-performing optimization for their specific hardware and use case, with benchmarks provided for comparison.

Quick Start & Requirements

  • Install via pip: pip install stochasticx
  • Requires Python and Docker.
  • For optimal performance, an NVIDIA GPU (A100 tested) with CUDA 11.6 is recommended.
  • Official quick-start and deployment guides are available within the repository.
  • Example Colab notebooks are provided for testing on T4 GPUs.

Highlighted Details

  • Achieves 0.88s latency with AITemplate on an A100 GPU using 30 inference steps.
  • Benchmarks show AITemplate achieving 1.38s latency on an A100 with 50 steps, significantly outperforming PyTorch (5.77s).
  • TensorRT integration may face memory issues for model conversion.
  • Supports batch processing, with AITemplate showing efficient scaling up to batch size 24 on an A100.

Maintenance & Community

Licensing & Compatibility

  • The project appears to be open-source, but a specific license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification on licensing terms.

Limitations & Caveats

  • AITemplate may not yet support T4 GPUs.
  • TensorRT encountered memory issues during ONNX to TensorRT conversion for the UNet model.
  • Benchmarks are based on specific hardware (A100 GPU, CUDA 11.6) and may vary on other configurations.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.