nunchaku  by nunchaku-tech

High-performance 4-bit diffusion model inference engine

created 8 months ago
2,508 stars

Top 19.1% on sourcepulse

GitHubView on GitHub
Project Summary

Nunchaku is a high-performance inference engine optimized for 4-bit neural networks, specifically diffusion models, addressing memory and speed limitations. It targets researchers and power users working with large generative models, offering significant speedups and reduced memory footprints for tasks like image generation.

How It Works

Nunchaku implements SVDQuant, a post-training quantization technique that handles outliers in activations and weights by migrating them to a low-rank component of the weights. This low-rank component is processed at 16-bit precision, while the residual is quantized to 4-bit. This approach alleviates quantization difficulty, enabling substantial memory reduction and speed improvements without significant quality degradation. The engine further optimizes this by fusing kernels to minimize latency overhead from the low-rank decomposition.

Quick Start & Requirements

  • Install: pip install <your-wheel-file>.whl (pre-built wheels available).
  • Prerequisites: PyTorch >= 2.5. For building from source: CUDA >= 12.2 (Linux) / 12.6 (Windows), GCC >= 11 (Linux), latest MSVC (Windows). Blackwell GPUs require CUDA 12.8+.
  • Compatibility: Supports NVIDIA GPUs with architectures sm_75 (Turing), sm_86 (Ampere), sm_89 (Ada), and sm_80 (A100).
  • Resources: Minimal memory requirement can be as low as 4 GiB with CPU offloading.
  • Docs: Paper, Website, Demo

Highlighted Details

  • Achieves 3.6x memory reduction and 3x speedup over BF16 on FLUX.1 models.
  • Offers custom FP16 attention for up to 1.2x faster performance on NVIDIA 30/40/50 series.
  • Supports First-Block Cache for up to 2x speedup in long-step denoising.
  • Seamlessly integrates customized LoRAs and ControlNets.

Maintenance & Community

  • Active development with frequent releases (v0.2.0 as of April 2025).
  • Community support via Slack, Discord, and WeChat.
  • Roadmap and FAQ available.

Licensing & Compatibility

  • The project itself does not explicitly state a license in the README. The underlying quantization library, DeepCompressor, is also not explicitly licensed in its linked repository. Compatibility for commercial use is not specified.

Limitations & Caveats

  • Building from source requires specific CUDA and compiler versions, with potential complexities for Windows users.
  • The README mentions NVFP4 precision is for Blackwell GPUs (50-series) and requires PyTorch 2.7+, while other sections mention support for 20-series GPUs with FP16 attention. Clarification on full compatibility across all supported architectures and precisions may be needed.
Health Check
Last commit

14 hours ago

Responsiveness

1 day

Pull Requests (30d)
27
Issues (30d)
91
Star History
942 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.