This repository provides the official PyTorch implementation for AQLM (Additive Quantization) and PV-Tuning, techniques for extreme compression of Large Language Models (LLMs). It enables significant reductions in model size and memory footprint while maintaining high accuracy, targeting researchers and practitioners working with LLMs who need to deploy them efficiently.
How It Works
AQLM achieves extreme compression by quantizing LLM weights using additive quantization, which decomposes weights into sums of vectors from learned codebooks. PV-Tuning further enhances this by introducing a novel finetuning algorithm that improves accuracy over traditional methods like Straight-Through Estimation. This approach allows for highly compressed models with minimal performance degradation.
Quick Start & Requirements
- Install the inference library:
pip install aqlm[gpu,cpu]>=1.1.6
- Requires PyTorch and Hugging Face Transformers.
- GPU with CUDA is recommended for faster inference and quantization.
- Quantization process can be resource-intensive, with a 7B model taking ~1 day on a single A100 GPU.
- See Colab examples for quick inference demos.
Highlighted Details
- Supports Llama, Mistral, and Mixtral model families.
- Achieves competitive perplexity scores, e.g., Llama-2-7b at ~1 bit achieves WikiText 2 PPL 7.85.
- Offers various quantization schemes (e.g., 1x16, 2x8, 8x8) with different accuracy/speed trade-offs.
- Includes kernels optimized for both GPU (Triton, CUDA) and CPU (Numba).
- PV-Tuning is accepted for oral presentation at NeurIPS'2024.
Maintenance & Community
- Active development with recent updates including PV-Tuning integration and improved 1-bit model accuracy.
- Papers accepted to ICML'2024 and NeurIPS'2024.
- Contribution guidelines provided; requires adherence to black and isort.
Licensing & Compatibility
- The repository does not explicitly state a license in the README. Users should verify licensing for both the code and any pre-quantized models.
Limitations & Caveats
- Quantization process is time-consuming, especially for larger models.
- Tokenized datasets are model-family specific due to tokenizer differences.
- Some attention implementations (e.g., SDPA) may cause issues; eager implementation is recommended.
- Reproducing older finetuning results requires using a specific commit (559a366).