deepcompressor by nunchaku-tech

Model compression toolbox for LLMs and diffusion models

Created 1 year ago

731 stars

Top 47.3% on SourcePulse

View on GitHub

3 Experts Love This Project

Coauthor of SGLang, vLLM

Project Summary

DeepCompressor is a PyTorch-based toolbox for compressing Large Language Models (LLMs) and Diffusion Models, targeting researchers and engineers aiming to deploy these models efficiently. It offers advanced quantization techniques, including 4-bit and 8-bit precision for weights and activations, significantly reducing memory footprint and latency while preserving model accuracy.

How It Works

The toolbox implements state-of-the-art quantization algorithms like AWQ, GPTQ, SmoothQuant, and its novel contributions: QoQ (W4A8KV4 for LLMs) and SVDQuant (W4A4 for diffusion models). QoQ addresses overheads in low-bit LLM serving by optimizing dequantization and KV cache handling, while SVDQuant tackles aggressive 4-bit quantization in diffusion models by absorbing outliers via low-rank components and a fused inference engine (Nunchaku) for efficiency.

Quick Start & Requirements

Installation: Clone the repository and install via poetry install after creating a conda environment (conda env create -f environment.yml).
Prerequisites: PyTorch, Python. Specific CUDA versions are not explicitly stated but are implied for GPU acceleration.
Resources: Requires significant GPU memory for training/quantization of large models.
Links: Nunchaku Inference System, QServe GPU System, QoQ Algorithm Code.

Highlighted Details

SVDQuant achieves 3.5x memory reduction and 3.0x speedup for a 12B diffusion model, outperforming 4-bit baselines.
QServe improves LLM serving throughput by up to 3.5x compared to TensorRT-LLM on A100/L40S GPUs.
Supports 4-bit quantization for both weights and KV cache (W4A8KV4) in LLMs.
Nunchaku system seamlessly supports off-the-shelf LoRAs without re-quantization.

Maintenance & Community

The project is associated with MIT HAN Lab, known for efficient generative AI research. Related projects have garnered significant attention (9k+ stars, 1M+ Huggingface downloads).

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

The README does not specify the exact license, which could impact commercial adoption. While extensive benchmarks are provided, specific hardware requirements beyond GPU acceleration are not detailed.

Health Check

Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

25 stars in the last 30 days