ComfyUI-nunchaku by nunchaku-tech

ComfyUI plugin for efficient 4-bit neural network inference

Created 10 months ago

2,673 stars

Top 17.5% on SourcePulse

Project Summary

This repository provides ComfyUI nodes for Nunchaku, an efficient inference engine for 4-bit neural networks quantized with SVDQuant. It targets users of ComfyUI looking to leverage highly optimized, memory-efficient diffusion models, offering significant speedups and reduced VRAM requirements.

How It Works

Nunchaku utilizes SVDQuant for 4-bit quantization, enabling efficient inference on consumer hardware. The ComfyUI nodes integrate this engine, providing specialized loaders for diffusion models, LoRAs, and text encoders. Key advantages include a custom FP16 attention implementation that outperforms flash-attention2 on compatible hardware and a First-Block Cache mechanism to further accelerate inference.

Quick Start & Requirements

Installation: Install via ComfyUI Manager or manually clone into ComfyUI/custom_nodes.
Prerequisites: ComfyUI, Python, comfy-cli (optional). Requires downloading specific models (e.g., FLUX.1-schnell, text encoders) from HuggingFace/ModelScope.
Compatibility: Supports NVIDIA 20-series (Turing) GPUs and newer. FP16 attention is required for 20-series GPUs.
Resources: Detailed installation tutorials (video/text) are available.

Highlighted Details

Nunchaku-FP16 attention is ~1.2x faster than flash-attention2 without precision loss.
Supports multi-LoRA and ControlNet integration.
Includes CPU offloading options for reduced GPU memory usage.
LoRA loading does not require pre-conversion.

Maintenance & Community

Active development with regular updates and roadmap publications.
Community support available via Slack, Discord, and WeChat.

Licensing & Compatibility

The specific license is not explicitly stated in the README, but it is associated with the MIT-HAN-LAB, suggesting a permissive license. Compatibility with commercial or closed-source projects is likely, but verification is recommended.

Limitations & Caveats

The 4-bit T5 model loading currently consumes excessive memory, with optimizations planned.
The FLUX.1 Depth Preprocessor node is deprecated.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

14

Issues (30d)

47

Star History

120 stars in the last 30 days

Explore Similar Projects

Starred by

Andreas Jansson

Andreas Jansson(Cofounder of Replicate).

flux-fp8-api by aredden

FastAPI for text-to-image diffusion using FP8

Created 1 year ago

Updated 1 year ago

Starred by

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

quip-sharp by Cornell-RelaxML

LLM quantization for extreme compression

Created 2 years ago

Updated 1 year ago

Starred by

Alex Chen

Alex Chen(Cofounder of Nexa AI),

Zack Li

Zack Li(Cofounder of Nexa AI), and

1 more.

deepcompressor by nunchaku-tech

Model compression toolbox for LLMs and diffusion models

Created 1 year ago

Updated 5 months ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Wing Lian

Wing Lian(Founder of Axolotl AI), and

1 more.

alpaca_lora_4bit by johnsmith0031

Fine-tuning and inference tool for quantized LLaMA models

Created 2 years ago

Updated 2 years ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI),

Woosuk Kwon

Woosuk Kwon(Coauthor of vLLM), and

1 more.

Model-Optimizer by NVIDIA

Library for optimizing deep learning models for GPU inference

Created 1 year ago

Updated 1 day ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

7 more.

llm-awq by mit-han-lab

Weight quantization research paper for LLM compression/acceleration

Created 2 years ago

Updated 5 months ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

3 more.

neural-compressor by intel

Python library for model compression (quantization, pruning, distillation, NAS)

Created 5 years ago

Updated 15 hours ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Ji Yichao

Ji Yichao(Cofounder of Manus), and

8 more.

AutoAWQ by casper-hansen

AutoAWQ is a tool for 4-bit quantized LLM inference

Created 2 years ago

Updated 8 months ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

Awesome-LLM-Inference by xlite-dev

Curated list of LLM/VLM inference research papers with code

Created 2 years ago

Updated 1 month ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

4 more.

gemma_pytorch by google

PyTorch implementation for Google's Gemma models

Created 1 year ago

Updated 7 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI) and

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

Inference optimization for LLMs on low-resource hardware

Created 2 years ago

Updated 4 months ago

Starred by

Michael Han

Michael Han(Cofounder of Unsloth),

Meng Zhang

Meng Zhang(Cofounder of TabbyML), and

11 more.

lmdeploy by InternLM

Toolkit for LLM compression, deployment, and serving

Created 2 years ago

Updated 2 days ago

Feedback? Help us improve.