hqq  by mobiusml

Model quantizer for fast, accurate post-training quantization, skipping calibration

Created 1 year ago
878 stars

Top 41.0% on SourcePulse

GitHubView on GitHub
Project Summary

HQQ (Half-Quadratic Quantization) is a fast, calibration-free quantization library for large machine learning models, supporting 1-8 bits. It enables efficient quantization of LLMs and vision models, significantly reducing VRAM usage and accelerating inference with minimal accuracy loss.

How It Works

HQQ employs a novel quantization approach that avoids the need for calibration data, drastically speeding up the quantization process. It quantizes weights into groups, offering flexibility with an axis parameter for grouping (0 or 1). The dequantization step is a linear operation, allowing seamless integration with optimized CUDA/Triton kernels and torch.compile for enhanced performance.

Quick Start & Requirements

  • Install: pip install hqq or pip install git+https://github.com/mobiusml/hqq.git
  • Requirements: PyTorch 2.x with matching CUDA version.
  • Usage: Replace torch.nn.Linear with HQQLinear and configure with BaseQuantizeConfig.
  • Examples and detailed usage for Hugging Face Transformers, VLLM, and PEFT are available in the repository.

Highlighted Details

  • Supports 1, 2, 3, 4, and 8-bit quantization.
  • Offers multiple backends for dequantization and optimized inference (PyTorch, PyTorch Compile, ATen/CUDA, Torchao's tiny_gemm, Gemlite, Bitblas).
  • Compatible with Hugging Face Transformers, PEFT, and VLLM.
  • Achieves ~158 tokens/sec for Llama3-8B 4-bit quantized on an RTX 4090.
  • HQQ+ introduces trainable low-rank adapters for improved low-bit quantization quality.

Maintenance & Community

  • Developed by MobiusML.
  • Active development with regular updates.
  • Examples and usage guides are provided.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial use and closed-source applications.

Limitations & Caveats

  • Optimized inference backends (Torchao, Gemlite, Bitblas) primarily support axis=1.
  • The ATen backend only supports axis=0.
  • Specific group-size values may have restrictions depending on the chosen inference backend.
Health Check
Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 30 days

Explore Similar Projects

Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai), Sasha Rush Sasha Rush(Research Scientist at Cursor; Professor at Cornell Tech), and
1 more.

GPTQ-triton by fpgaminer

0%
307
Triton kernel for GPTQ inference, improving context scaling
Created 2 years ago
Updated 2 years ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

gptq by IST-DASLab

0.1%
2k
Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers
Created 2 years ago
Updated 1 year ago
Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

AQLM by Vahe1994

0.4%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
Created 1 year ago
Updated 1 month ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

neural-compressor by intel

0.2%
2k
Python library for model compression (quantization, pruning, distillation, NAS)
Created 5 years ago
Updated 14 hours ago
Feedback? Help us improve.