x-stable-diffusion by stochasticai

CLI tool for Stable Diffusion acceleration techniques

Created 3 years ago

561 stars

Top 57.1% on SourcePulse

1 Expert Loves This Project

luiscape

Cofounder of Lightning AI

Project Summary

This project provides optimized inference pipelines for Stable Diffusion, targeting users who need to generate images with significantly reduced latency and resource consumption. It offers a command-line interface (CLI) for easy deployment and inference, enabling faster and more cost-effective image generation.

How It Works

The project integrates several key acceleration techniques: Meta's AITemplate, NVIDIA's TensorRT, nvFuser, and FlashAttention (via Xformers). These frameworks are applied to optimize the Stable Diffusion model, aiming for lower latency and reduced VRAM usage. The approach allows users to select the best-performing optimization for their specific hardware and use case, with benchmarks provided for comparison.

Quick Start & Requirements

Install via pip: pip install stochasticx
Requires Python and Docker.
For optimal performance, an NVIDIA GPU (A100 tested) with CUDA 11.6 is recommended.
Official quick-start and deployment guides are available within the repository.
Example Colab notebooks are provided for testing on T4 GPUs.

Highlighted Details

Achieves 0.88s latency with AITemplate on an A100 GPU using 30 inference steps.
Benchmarks show AITemplate achieving 1.38s latency on an A100 with 50 steps, significantly outperforming PyTorch (5.77s).
TensorRT integration may face memory issues for model conversion.
Supports batch processing, with AITemplate showing efficient scaling up to batch size 24 on an A100.

Maintenance & Community

Active community support via Discord: https://discord.com/invite/TgHXuSJEk6
Contributions are welcomed.

Licensing & Compatibility

The project appears to be open-source, but a specific license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification on licensing terms.

Limitations & Caveats

AITemplate may not yet support T4 GPUs.
TensorRT encountered memory issues during ONNX to TensorRT conversion for the UNet model.
Benchmarks are based on specific hardware (A100 GPU, CUDA 11.6) and may vary on other configurations.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

0 stars in the last 30 days

Explore Similar Projects

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI).

torch-profiling-tutorial by Quentin-Anthony

PyTorch model profiling tutorial

Created 6 months ago

Updated 5 months ago

Starred by

Andreas Jansson

Andreas Jansson(Cofounder of Replicate).

flux-fp8-api by aredden

FastAPI for text-to-image diffusion using FP8

Created 1 year ago

Updated 1 year ago

ROCmLibs-for-gfx1103-AMD780M-APU by likelovewant

ROCm library for boosting AMD GPU performance on Windows

Created 1 year ago

Updated 3 months ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai) and

Zhuohan Li

Zhuohan Li(Coauthor of vLLM).

marlin by IST-DASLab

FP16xINT4 kernel for fast LLM inference

Created 2 years ago

Updated 1 year ago

transformers-benchmarks by mli

Transformer training benchmark for GPUs

Created 3 years ago

Updated 2 years ago

PyTorchTricks by lartpang

Collection of PyTorch performance optimization tricks

Created 6 years ago

Updated 1 year ago

bolt by huawei-noah

Deep learning library for high-performance, heterogeneous deployment

Created 6 years ago

Updated 9 months ago

fastsdcpu by rupeshs

CPU-based Stable Diffusion for fast image generation

Created 2 years ago

Updated 1 day ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI),

Chaoyu Yang

Chaoyu Yang(Founder of Bento), and

4 more.

nunchaku by nunchaku-ai

High-performance 4-bit diffusion model inference engine

Created 1 year ago

Updated 14 hours ago

Starred by

Ji Yichao

Ji Yichao(Cofounder of Manus) and

Ying Sheng

Ying Sheng(Coauthor of SGLang).

how-to-optim-algorithm-in-cuda by BBuf

CUDA optimization guide for common algorithms

Created 7 years ago

Updated 3 days ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Ying Sheng

Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

High-performance C++ LLM inference library

Created 2 years ago

Updated 1 month ago

Starred by

Lianmin Zheng

Lianmin Zheng(Coauthor of SGLang, vLLM),

Simon Willison

Simon Willison(Coauthor of Django), and

9 more.

CTranslate2 by OpenNMT

Fast inference engine for Transformer models

Created 6 years ago

Updated 14 hours ago

Feedback? Help us improve.