deepcompressor  by nunchaku-tech

Model compression toolbox for LLMs and diffusion models

created 1 year ago
565 stars

Top 57.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

DeepCompressor is a PyTorch-based toolbox for compressing Large Language Models (LLMs) and Diffusion Models, targeting researchers and engineers aiming to deploy these models efficiently. It offers advanced quantization techniques, including 4-bit and 8-bit precision for weights and activations, significantly reducing memory footprint and latency while preserving model accuracy.

How It Works

The toolbox implements state-of-the-art quantization algorithms like AWQ, GPTQ, SmoothQuant, and its novel contributions: QoQ (W4A8KV4 for LLMs) and SVDQuant (W4A4 for diffusion models). QoQ addresses overheads in low-bit LLM serving by optimizing dequantization and KV cache handling, while SVDQuant tackles aggressive 4-bit quantization in diffusion models by absorbing outliers via low-rank components and a fused inference engine (Nunchaku) for efficiency.

Quick Start & Requirements

  • Installation: Clone the repository and install via poetry install after creating a conda environment (conda env create -f environment.yml).
  • Prerequisites: PyTorch, Python. Specific CUDA versions are not explicitly stated but are implied for GPU acceleration.
  • Resources: Requires significant GPU memory for training/quantization of large models.
  • Links: Nunchaku Inference System, QServe GPU System, QoQ Algorithm Code.

Highlighted Details

  • SVDQuant achieves 3.5x memory reduction and 3.0x speedup for a 12B diffusion model, outperforming 4-bit baselines.
  • QServe improves LLM serving throughput by up to 3.5x compared to TensorRT-LLM on A100/L40S GPUs.
  • Supports 4-bit quantization for both weights and KV cache (W4A8KV4) in LLMs.
  • Nunchaku system seamlessly supports off-the-shelf LoRAs without re-quantization.

Maintenance & Community

The project is associated with MIT HAN Lab, known for efficient generative AI research. Related projects have garnered significant attention (9k+ stars, 1M+ Huggingface downloads).

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

The README does not specify the exact license, which could impact commercial adoption. While extensive benchmarks are provided, specific hardware requirements beyond GPU acceleration are not detailed.

Health Check
Last commit

4 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
4
Star History
120 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 15 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

llm-awq by mit-han-lab

0.4%
3k
Weight quantization research paper for LLM compression/acceleration
created 2 years ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Feedback? Help us improve.