flux-fp8-api  by aredden

FastAPI for text-to-image diffusion using FP8

created 1 year ago
272 stars

Top 95.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a FastAPI implementation of the Flux diffusion model, optimized for speed through FP8 matrix multiplication and other quantization techniques. It targets users seeking faster image generation on consumer hardware, offering a ~2x speedup over baseline implementations.

How It Works

The core innovation lies in leveraging FP8 precision for matrix multiplications within the Flux model, significantly accelerating computation. The implementation also supports compiling specific model blocks and additional layers for further performance gains. Remaining layers utilize faster half-precision accumulation.

Quick Start & Requirements

  • Installation: Requires PyTorch with CUDA 12.4. Install via Conda/Mamba or pip.
    • Conda/Mamba: mamba create -n flux-fp8-matmul-api python=3.11 pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
    • Pip: python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
    • Then: python -m pip install -r requirements.txt
  • Prerequisites: NVIDIA GPU with FP8 support and at least 16GB VRAM (e.g., RTX 40-series Ada). CUDA 12.4. Python 3.11.
  • Setup: Requires downloading original BFL checkpoints for the flow transformer, autoencoder, and text encoder. Configuration involves specifying paths to these checkpoints in JSON files.
  • Running the API: python main.py --config-path <path_to_config> --port <port_number>
  • Docs: README

Highlighted Details

  • Achieves up to 2.8x speedup on RTX 6000 Ada and 3.51x on RTX 4090 with compiled blocks and extras enabled, compared to baseline FP8.
  • Supports LoRA loading via API and pipeline object.
  • Offers configurable quantization levels for flow transformer modulation and embedder layers.
  • Allows specifying custom CLIP model paths.

Maintenance & Community

  • Last updated: October 3, 2024.
  • Features LoRA loading and configurable quantization options.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • Requires specific NVIDIA hardware (Ada generation GPUs) with FP8 support and CUDA 12.4.
  • Pre-quantized flow models are tied to specific quantization settings used during their creation.
  • Using bfloat16 for flow_dtype is recommended for quality but may slightly slow down consumer GPUs.
Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
11 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 19 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
6 more.

FasterTransformer by NVIDIA

0.2%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Feedback? Help us improve.