VPTQ by microsoft

LLM quantization research paper targeting extreme low-bit compression

Created 1 year ago

671 stars

Top 50.4% on SourcePulse

Project Summary

VPTQ offers an extreme low-bit (sub-2-bit) post-training quantization method for Large Language Models (LLMs), enabling significant model compression without retraining. It targets researchers and engineers seeking to deploy large models on resource-constrained hardware by reducing memory footprint and bandwidth requirements.

How It Works

VPTQ leverages Vector Quantization (VQ) to compress LLM weights into indices using lookup tables. This approach overcomes the limitations of traditional scalar quantization at extremely low bit-widths, achieving high accuracy by representing weight vectors efficiently. The method involves quantizing vectors into codebooks, allowing for significant compression while preserving model performance.

Quick Start & Requirements

Install: pip install vptq
Prerequisites: Python 3.10+, CUDA toolkit, torch >= 2.3.0, transformers >= 4.44.0, accelerate >= 0.33.0, flash_attn >= 2.5.0, cmake >= 3.18.0.
Resources: Quantizing a 405B model takes approximately 17 hours.
Docs: Get Started, Technical Report, Huggingface Demo, Colab

Highlighted Details

Achieves high accuracy on models down to 1-2 bits (e.g., 405B @ <2bit, 70B @ 2bit).
Quantization process is lightweight, with a 405B model taking ~17 hours.
Offers agile inference with low decode overhead and high throughput.
Integrated into Huggingface Transformers (v4.48.0+) and supports inference engines like aphrodite-engine.

Maintenance & Community

Project is actively developed by Microsoft researchers.
Community contributions are encouraged via GitHub issues and pull requests.
Roadmap includes improving inference kernels (CUDA, ROCm, Triton) and integration with vLLM, llama.cpp, and edge deployment.

Licensing & Compatibility

License: MIT
Compatibility: Permissive for commercial use and integration with closed-source applications.

Limitations & Caveats

VPTQ is intended for research and experimental purposes, requiring further validation for production use. The repository provides the quantization algorithm; performance of community-provided quantized models is not guaranteed. Current testing is limited to English text.

Health Check

Last Commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days