flux-fp8-api by aredden

FastAPI for text-to-image diffusion using FP8

Created 1 year ago

285 stars

Top 91.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Andreas Jansson

Cofounder of Replicate

Project Summary

This repository provides a FastAPI implementation of the Flux diffusion model, optimized for speed through FP8 matrix multiplication and other quantization techniques. It targets users seeking faster image generation on consumer hardware, offering a ~2x speedup over baseline implementations.

How It Works

The core innovation lies in leveraging FP8 precision for matrix multiplications within the Flux model, significantly accelerating computation. The implementation also supports compiling specific model blocks and additional layers for further performance gains. Remaining layers utilize faster half-precision accumulation.

Quick Start & Requirements

Installation: Requires PyTorch with CUDA 12.4. Install via Conda/Mamba or pip.
- Conda/Mamba: mamba create -n flux-fp8-matmul-api python=3.11 pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
- Pip: python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
- Then: python -m pip install -r requirements.txt
Prerequisites: NVIDIA GPU with FP8 support and at least 16GB VRAM (e.g., RTX 40-series Ada). CUDA 12.4. Python 3.11.
Setup: Requires downloading original BFL checkpoints for the flow transformer, autoencoder, and text encoder. Configuration involves specifying paths to these checkpoints in JSON files.
Running the API: python main.py --config-path <path_to_config> --port <port_number>
Docs: README

Highlighted Details

Achieves up to 2.8x speedup on RTX 6000 Ada and 3.51x on RTX 4090 with compiled blocks and extras enabled, compared to baseline FP8.
Supports LoRA loading via API and pipeline object.
Offers configurable quantization levels for flow transformer modulation and embedder layers.
Allows specifying custom CLIP model paths.

Maintenance & Community

Last updated: October 3, 2024.
Features LoRA loading and configurable quantization options.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

Requires specific NVIDIA hardware (Ada generation GPUs) with FP8 support and CUDA 12.4.
Pre-quantized flow models are tied to specific quantization settings used during their creation.
Using bfloat16 for flow_dtype is recommended for quality but may slightly slow down consumer GPUs.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days