LLM quantization research paper targeting extreme low-bit compression
Top 52.4% on sourcepulse
VPTQ offers an extreme low-bit (sub-2-bit) post-training quantization method for Large Language Models (LLMs), enabling significant model compression without retraining. It targets researchers and engineers seeking to deploy large models on resource-constrained hardware by reducing memory footprint and bandwidth requirements.
How It Works
VPTQ leverages Vector Quantization (VQ) to compress LLM weights into indices using lookup tables. This approach overcomes the limitations of traditional scalar quantization at extremely low bit-widths, achieving high accuracy by representing weight vectors efficiently. The method involves quantizing vectors into codebooks, allowing for significant compression while preserving model performance.
Quick Start & Requirements
pip install vptq
torch >= 2.3.0
, transformers >= 4.44.0
, accelerate >= 0.33.0
, flash_attn >= 2.5.0
, cmake >= 3.18.0
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
VPTQ is intended for research and experimental purposes, requiring further validation for production use. The repository provides the quantization algorithm; performance of community-provided quantized models is not guaranteed. Current testing is limited to English text.
3 months ago
Inactive