nunchaku by nunchaku-ai

High-performance 4-bit diffusion model inference engine

Created 1 year ago

3,584 stars

Top 13.4% on SourcePulse

View on GitHub

6 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Chaoyu Yang

Founder of Bento

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Alex Chen

Cofounder of Nexa AI

and 2 more!

Project Summary

Nunchaku is a high-performance inference engine optimized for 4-bit neural networks, specifically diffusion models, addressing memory and speed limitations. It targets researchers and power users working with large generative models, offering significant speedups and reduced memory footprints for tasks like image generation.

How It Works

Nunchaku implements SVDQuant, a post-training quantization technique that handles outliers in activations and weights by migrating them to a low-rank component of the weights. This low-rank component is processed at 16-bit precision, while the residual is quantized to 4-bit. This approach alleviates quantization difficulty, enabling substantial memory reduction and speed improvements without significant quality degradation. The engine further optimizes this by fusing kernels to minimize latency overhead from the low-rank decomposition.

Quick Start & Requirements

Install: pip install <your-wheel-file>.whl (pre-built wheels available).
Prerequisites: PyTorch >= 2.5. For building from source: CUDA >= 12.2 (Linux) / 12.6 (Windows), GCC >= 11 (Linux), latest MSVC (Windows). Blackwell GPUs require CUDA 12.8+.
Compatibility: Supports NVIDIA GPUs with architectures sm_75 (Turing), sm_86 (Ampere), sm_89 (Ada), and sm_80 (A100).
Resources: Minimal memory requirement can be as low as 4 GiB with CPU offloading.
Docs: Paper, Website, Demo

Highlighted Details

Achieves 3.6x memory reduction and 3x speedup over BF16 on FLUX.1 models.
Offers custom FP16 attention for up to 1.2x faster performance on NVIDIA 30/40/50 series.
Supports First-Block Cache for up to 2x speedup in long-step denoising.
Seamlessly integrates customized LoRAs and ControlNets.

Maintenance & Community

Active development with frequent releases (v0.2.0 as of April 2025).
Community support via Slack, Discord, and WeChat.
Roadmap and FAQ available.

Licensing & Compatibility

The project itself does not explicitly state a license in the README. The underlying quantization library, DeepCompressor, is also not explicitly licensed in its linked repository. Compatibility for commercial use is not specified.

Limitations & Caveats

Building from source requires specific CUDA and compiler versions, with potential complexities for Windows users.
The README mentions NVFP4 precision is for Blackwell GPUs (50-series) and requires PyTorch 2.7+, while other sections mention support for 20-series GPUs with FP16 attention. Clarification on full compatibility across all supported architectures and precisions may be needed.

Health Check

Last Commit

9 hours ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

100

Star History

153 stars in the last 30 days