Nunchaku is a high-performance inference engine optimized for 4-bit neural networks, specifically diffusion models, addressing memory and speed limitations. It targets researchers and power users working with large generative models, offering significant speedups and reduced memory footprints for tasks like image generation.
How It Works
Nunchaku implements SVDQuant, a post-training quantization technique that handles outliers in activations and weights by migrating them to a low-rank component of the weights. This low-rank component is processed at 16-bit precision, while the residual is quantized to 4-bit. This approach alleviates quantization difficulty, enabling substantial memory reduction and speed improvements without significant quality degradation. The engine further optimizes this by fusing kernels to minimize latency overhead from the low-rank decomposition.
Quick Start & Requirements
- Install:
pip install <your-wheel-file>.whl
(pre-built wheels available).
- Prerequisites: PyTorch >= 2.5. For building from source: CUDA >= 12.2 (Linux) / 12.6 (Windows), GCC >= 11 (Linux), latest MSVC (Windows). Blackwell GPUs require CUDA 12.8+.
- Compatibility: Supports NVIDIA GPUs with architectures sm_75 (Turing), sm_86 (Ampere), sm_89 (Ada), and sm_80 (A100).
- Resources: Minimal memory requirement can be as low as 4 GiB with CPU offloading.
- Docs: Paper, Website, Demo
Highlighted Details
- Achieves 3.6x memory reduction and 3x speedup over BF16 on FLUX.1 models.
- Offers custom FP16 attention for up to 1.2x faster performance on NVIDIA 30/40/50 series.
- Supports First-Block Cache for up to 2x speedup in long-step denoising.
- Seamlessly integrates customized LoRAs and ControlNets.
Maintenance & Community
- Active development with frequent releases (v0.2.0 as of April 2025).
- Community support via Slack, Discord, and WeChat.
- Roadmap and FAQ available.
Licensing & Compatibility
- The project itself does not explicitly state a license in the README. The underlying quantization library, DeepCompressor, is also not explicitly licensed in its linked repository. Compatibility for commercial use is not specified.
Limitations & Caveats
- Building from source requires specific CUDA and compiler versions, with potential complexities for Windows users.
- The README mentions NVFP4 precision is for Blackwell GPUs (50-series) and requires PyTorch 2.7+, while other sections mention support for 20-series GPUs with FP16 attention. Clarification on full compatibility across all supported architectures and precisions may be needed.