hqq by dropbox

Model quantizer for fast, accurate post-training quantization, skipping calibration

Created 2 years ago

904 stars

Top 40.1% on SourcePulse

View on GitHub

5 Experts Love This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Daniel Han

Cofounder of Unsloth

Wing Lian

Founder of Axolotl AI

Yaowei Zheng

Author of LLaMA-Factory

and 1 more!

Project Summary

HQQ (Half-Quadratic Quantization) is a fast, calibration-free quantization library for large machine learning models, supporting 1-8 bits. It enables efficient quantization of LLMs and vision models, significantly reducing VRAM usage and accelerating inference with minimal accuracy loss.

How It Works

HQQ employs a novel quantization approach that avoids the need for calibration data, drastically speeding up the quantization process. It quantizes weights into groups, offering flexibility with an axis parameter for grouping (0 or 1). The dequantization step is a linear operation, allowing seamless integration with optimized CUDA/Triton kernels and torch.compile for enhanced performance.

Quick Start & Requirements

Install: pip install hqq or pip install git+https://github.com/mobiusml/hqq.git
Requirements: PyTorch 2.x with matching CUDA version.
Usage: Replace torch.nn.Linear with HQQLinear and configure with BaseQuantizeConfig.
Examples and detailed usage for Hugging Face Transformers, VLLM, and PEFT are available in the repository.

Highlighted Details

Supports 1, 2, 3, 4, and 8-bit quantization.
Offers multiple backends for dequantization and optimized inference (PyTorch, PyTorch Compile, ATen/CUDA, Torchao's tiny_gemm, Gemlite, Bitblas).
Compatible with Hugging Face Transformers, PEFT, and VLLM.
Achieves ~158 tokens/sec for Llama3-8B 4-bit quantized on an RTX 4090.
HQQ+ introduces trainable low-rank adapters for improved low-bit quantization quality.

Maintenance & Community

Developed by MobiusML.
Active development with regular updates.
Examples and usage guides are provided.

Licensing & Compatibility

License: Apache 2.0.
Compatible with commercial use and closed-source applications.

Limitations & Caveats

Optimized inference backends (Torchao, Gemlite, Bitblas) primarily support axis=1.
The ATen backend only supports axis=0.
Specific group-size values may have restrictions depending on the chosen inference backend.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days