VPTQ  by microsoft

LLM quantization research paper targeting extreme low-bit compression

created 11 months ago
649 stars

Top 52.4% on sourcepulse

GitHubView on GitHub
Project Summary

VPTQ offers an extreme low-bit (sub-2-bit) post-training quantization method for Large Language Models (LLMs), enabling significant model compression without retraining. It targets researchers and engineers seeking to deploy large models on resource-constrained hardware by reducing memory footprint and bandwidth requirements.

How It Works

VPTQ leverages Vector Quantization (VQ) to compress LLM weights into indices using lookup tables. This approach overcomes the limitations of traditional scalar quantization at extremely low bit-widths, achieving high accuracy by representing weight vectors efficiently. The method involves quantizing vectors into codebooks, allowing for significant compression while preserving model performance.

Quick Start & Requirements

  • Install: pip install vptq
  • Prerequisites: Python 3.10+, CUDA toolkit, torch >= 2.3.0, transformers >= 4.44.0, accelerate >= 0.33.0, flash_attn >= 2.5.0, cmake >= 3.18.0.
  • Resources: Quantizing a 405B model takes approximately 17 hours.
  • Docs: Get Started, Technical Report, Huggingface Demo, Colab

Highlighted Details

  • Achieves high accuracy on models down to 1-2 bits (e.g., 405B @ <2bit, 70B @ 2bit).
  • Quantization process is lightweight, with a 405B model taking ~17 hours.
  • Offers agile inference with low decode overhead and high throughput.
  • Integrated into Huggingface Transformers (v4.48.0+) and supports inference engines like aphrodite-engine.

Maintenance & Community

  • Project is actively developed by Microsoft researchers.
  • Community contributions are encouraged via GitHub issues and pull requests.
  • Roadmap includes improving inference kernels (CUDA, ROCm, Triton) and integration with vLLM, llama.cpp, and edge deployment.

Licensing & Compatibility

  • License: MIT
  • Compatibility: Permissive for commercial use and integration with closed-source applications.

Limitations & Caveats

VPTQ is intended for research and experimental purposes, requiring further validation for production use. The repository provides the quantization algorithm; performance of community-provided quantized models is not guaranteed. Current testing is limited to English text.

Health Check
Last commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
2
Star History
19 stars in the last 90 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Feedback? Help us improve.