Model quantizer for fast, accurate post-training quantization, skipping calibration
Top 42.7% on sourcepulse
HQQ (Half-Quadratic Quantization) is a fast, calibration-free quantization library for large machine learning models, supporting 1-8 bits. It enables efficient quantization of LLMs and vision models, significantly reducing VRAM usage and accelerating inference with minimal accuracy loss.
How It Works
HQQ employs a novel quantization approach that avoids the need for calibration data, drastically speeding up the quantization process. It quantizes weights into groups, offering flexibility with an axis
parameter for grouping (0 or 1). The dequantization step is a linear operation, allowing seamless integration with optimized CUDA/Triton kernels and torch.compile
for enhanced performance.
Quick Start & Requirements
pip install hqq
or pip install git+https://github.com/mobiusml/hqq.git
torch.nn.Linear
with HQQLinear
and configure with BaseQuantizeConfig
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
axis=1
.axis=0
.5 days ago
1 day