AQLM  by Vahe1994

PyTorch code for LLM compression via Additive Quantization (AQLM)

created 1 year ago
1,278 stars

Top 31.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official PyTorch implementation for AQLM (Additive Quantization) and PV-Tuning, techniques for extreme compression of Large Language Models (LLMs). It enables significant reductions in model size and memory footprint while maintaining high accuracy, targeting researchers and practitioners working with LLMs who need to deploy them efficiently.

How It Works

AQLM achieves extreme compression by quantizing LLM weights using additive quantization, which decomposes weights into sums of vectors from learned codebooks. PV-Tuning further enhances this by introducing a novel finetuning algorithm that improves accuracy over traditional methods like Straight-Through Estimation. This approach allows for highly compressed models with minimal performance degradation.

Quick Start & Requirements

  • Install the inference library: pip install aqlm[gpu,cpu]>=1.1.6
  • Requires PyTorch and Hugging Face Transformers.
  • GPU with CUDA is recommended for faster inference and quantization.
  • Quantization process can be resource-intensive, with a 7B model taking ~1 day on a single A100 GPU.
  • See Colab examples for quick inference demos.

Highlighted Details

  • Supports Llama, Mistral, and Mixtral model families.
  • Achieves competitive perplexity scores, e.g., Llama-2-7b at ~1 bit achieves WikiText 2 PPL 7.85.
  • Offers various quantization schemes (e.g., 1x16, 2x8, 8x8) with different accuracy/speed trade-offs.
  • Includes kernels optimized for both GPU (Triton, CUDA) and CPU (Numba).
  • PV-Tuning is accepted for oral presentation at NeurIPS'2024.

Maintenance & Community

  • Active development with recent updates including PV-Tuning integration and improved 1-bit model accuracy.
  • Papers accepted to ICML'2024 and NeurIPS'2024.
  • Contribution guidelines provided; requires adherence to black and isort.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Users should verify licensing for both the code and any pre-quantized models.

Limitations & Caveats

  • Quantization process is time-consuming, especially for larger models.
  • Tokenized datasets are model-family specific due to tokenizer differences.
  • Some attention implementations (e.g., SDPA) may cause issues; eager implementation is recommended.
  • Reproducing older finetuning results requires using a specific commit (559a366).
Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
28 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Feedback? Help us improve.