ComfyUI-nunchaku  by nunchaku-tech

ComfyUI plugin for efficient 4-bit neural network inference

Created 7 months ago
2,431 stars

Top 18.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides ComfyUI nodes for Nunchaku, an efficient inference engine for 4-bit neural networks quantized with SVDQuant. It targets users of ComfyUI looking to leverage highly optimized, memory-efficient diffusion models, offering significant speedups and reduced VRAM requirements.

How It Works

Nunchaku utilizes SVDQuant for 4-bit quantization, enabling efficient inference on consumer hardware. The ComfyUI nodes integrate this engine, providing specialized loaders for diffusion models, LoRAs, and text encoders. Key advantages include a custom FP16 attention implementation that outperforms flash-attention2 on compatible hardware and a First-Block Cache mechanism to further accelerate inference.

Quick Start & Requirements

  • Installation: Install via ComfyUI Manager or manually clone into ComfyUI/custom_nodes.
  • Prerequisites: ComfyUI, Python, comfy-cli (optional). Requires downloading specific models (e.g., FLUX.1-schnell, text encoders) from HuggingFace/ModelScope.
  • Compatibility: Supports NVIDIA 20-series (Turing) GPUs and newer. FP16 attention is required for 20-series GPUs.
  • Resources: Detailed installation tutorials (video/text) are available.

Highlighted Details

  • Nunchaku-FP16 attention is ~1.2x faster than flash-attention2 without precision loss.
  • Supports multi-LoRA and ControlNet integration.
  • Includes CPU offloading options for reduced GPU memory usage.
  • LoRA loading does not require pre-conversion.

Maintenance & Community

  • Active development with regular updates and roadmap publications.
  • Community support available via Slack, Discord, and WeChat.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README, but it is associated with the MIT-HAN-LAB, suggesting a permissive license. Compatibility with commercial or closed-source projects is likely, but verification is recommended.

Limitations & Caveats

  • The 4-bit T5 model loading currently consumes excessive memory, with optimizations planned.
  • The FLUX.1 Depth Preprocessor node is deprecated.
Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
24
Issues (30d)
53
Star History
137 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.2%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 3 months ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

neural-compressor by intel

0.1%
3k
Python library for model compression (quantization, pruning, distillation, NAS)
Created 5 years ago
Updated 9 hours ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.7%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.1%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 5 months ago
Feedback? Help us improve.