AQLM by Vahe1994

PyTorch code for LLM compression via Additive Quantization (AQLM)

Created 2 years ago

1,313 stars

Top 30.3% on SourcePulse

View on GitHub

7 Experts Love This Project

Lysandre Debut

Chief Open-Source Officer at Hugging Face

Maxime Labonne

Head of Post-Training at Liquid AI

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Yaowei Zheng

Author of LLaMA-Factory

and 3 more!

Project Summary

This repository provides the official PyTorch implementation for AQLM (Additive Quantization) and PV-Tuning, techniques for extreme compression of Large Language Models (LLMs). It enables significant reductions in model size and memory footprint while maintaining high accuracy, targeting researchers and practitioners working with LLMs who need to deploy them efficiently.

How It Works

AQLM achieves extreme compression by quantizing LLM weights using additive quantization, which decomposes weights into sums of vectors from learned codebooks. PV-Tuning further enhances this by introducing a novel finetuning algorithm that improves accuracy over traditional methods like Straight-Through Estimation. This approach allows for highly compressed models with minimal performance degradation.

Quick Start & Requirements

Install the inference library: pip install aqlm[gpu,cpu]>=1.1.6
Requires PyTorch and Hugging Face Transformers.
GPU with CUDA is recommended for faster inference and quantization.
Quantization process can be resource-intensive, with a 7B model taking ~1 day on a single A100 GPU.
See Colab examples for quick inference demos.

Highlighted Details

Supports Llama, Mistral, and Mixtral model families.
Achieves competitive perplexity scores, e.g., Llama-2-7b at ~1 bit achieves WikiText 2 PPL 7.85.
Offers various quantization schemes (e.g., 1x16, 2x8, 8x8) with different accuracy/speed trade-offs.
Includes kernels optimized for both GPU (Triton, CUDA) and CPU (Numba).
PV-Tuning is accepted for oral presentation at NeurIPS'2024.

Maintenance & Community

Active development with recent updates including PV-Tuning integration and improved 1-bit model accuracy.
Papers accepted to ICML'2024 and NeurIPS'2024.
Contribution guidelines provided; requires adherence to black and isort.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for both the code and any pre-quantized models.

Limitations & Caveats

Quantization process is time-consuming, especially for larger models.
Tokenized datasets are model-family specific due to tokenizer differences.
Some attention implementations (e.g., SDPA) may cause issues; eager implementation is recommended.
Reproducing older finetuning results requires using a specific commit (559a366).

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days