deepcompressor  by nunchaku-tech

Model compression toolbox for LLMs and diffusion models

Created 1 year ago
632 stars

Top 52.4% on SourcePulse

GitHubView on GitHub
Project Summary

DeepCompressor is a PyTorch-based toolbox for compressing Large Language Models (LLMs) and Diffusion Models, targeting researchers and engineers aiming to deploy these models efficiently. It offers advanced quantization techniques, including 4-bit and 8-bit precision for weights and activations, significantly reducing memory footprint and latency while preserving model accuracy.

How It Works

The toolbox implements state-of-the-art quantization algorithms like AWQ, GPTQ, SmoothQuant, and its novel contributions: QoQ (W4A8KV4 for LLMs) and SVDQuant (W4A4 for diffusion models). QoQ addresses overheads in low-bit LLM serving by optimizing dequantization and KV cache handling, while SVDQuant tackles aggressive 4-bit quantization in diffusion models by absorbing outliers via low-rank components and a fused inference engine (Nunchaku) for efficiency.

Quick Start & Requirements

  • Installation: Clone the repository and install via poetry install after creating a conda environment (conda env create -f environment.yml).
  • Prerequisites: PyTorch, Python. Specific CUDA versions are not explicitly stated but are implied for GPU acceleration.
  • Resources: Requires significant GPU memory for training/quantization of large models.
  • Links: Nunchaku Inference System, QServe GPU System, QoQ Algorithm Code.

Highlighted Details

  • SVDQuant achieves 3.5x memory reduction and 3.0x speedup for a 12B diffusion model, outperforming 4-bit baselines.
  • QServe improves LLM serving throughput by up to 3.5x compared to TensorRT-LLM on A100/L40S GPUs.
  • Supports 4-bit quantization for both weights and KV cache (W4A8KV4) in LLMs.
  • Nunchaku system seamlessly supports off-the-shelf LoRAs without re-quantization.

Maintenance & Community

The project is associated with MIT HAN Lab, known for efficient generative AI research. Related projects have garnered significant attention (9k+ stars, 1M+ Huggingface downloads).

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

The README does not specify the exact license, which could impact commercial adoption. While extensive benchmarks are provided, specific hardware requirements beyond GPU acceleration are not detailed.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
5
Star History
36 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

neural-compressor by intel

0.2%
2k
Python library for model compression (quantization, pruning, distillation, NAS)
Created 5 years ago
Updated 15 hours ago
Feedback? Help us improve.