ComfyUI-nunchaku  by nunchaku-tech

ComfyUI plugin for efficient 4-bit neural network inference

Created 6 months ago
2,188 stars

Top 20.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides ComfyUI nodes for Nunchaku, an efficient inference engine for 4-bit neural networks quantized with SVDQuant. It targets users of ComfyUI looking to leverage highly optimized, memory-efficient diffusion models, offering significant speedups and reduced VRAM requirements.

How It Works

Nunchaku utilizes SVDQuant for 4-bit quantization, enabling efficient inference on consumer hardware. The ComfyUI nodes integrate this engine, providing specialized loaders for diffusion models, LoRAs, and text encoders. Key advantages include a custom FP16 attention implementation that outperforms flash-attention2 on compatible hardware and a First-Block Cache mechanism to further accelerate inference.

Quick Start & Requirements

  • Installation: Install via ComfyUI Manager or manually clone into ComfyUI/custom_nodes.
  • Prerequisites: ComfyUI, Python, comfy-cli (optional). Requires downloading specific models (e.g., FLUX.1-schnell, text encoders) from HuggingFace/ModelScope.
  • Compatibility: Supports NVIDIA 20-series (Turing) GPUs and newer. FP16 attention is required for 20-series GPUs.
  • Resources: Detailed installation tutorials (video/text) are available.

Highlighted Details

  • Nunchaku-FP16 attention is ~1.2x faster than flash-attention2 without precision loss.
  • Supports multi-LoRA and ControlNet integration.
  • Includes CPU offloading options for reduced GPU memory usage.
  • LoRA loading does not require pre-conversion.

Maintenance & Community

  • Active development with regular updates and roadmap publications.
  • Community support available via Slack, Discord, and WeChat.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README, but it is associated with the MIT-HAN-LAB, suggesting a permissive license. Compatibility with commercial or closed-source projects is likely, but verification is recommended.

Limitations & Caveats

  • The 4-bit T5 model loading currently consumes excessive memory, with optimizations planned.
  • The FLUX.1 Depth Preprocessor node is deprecated.
Health Check
Last Commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
21
Issues (30d)
139
Star History
317 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

neural-compressor by intel

0.2%
2k
Python library for model compression (quantization, pruning, distillation, NAS)
Created 5 years ago
Updated 15 hours ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.2%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.