LLM quantization for extreme compression
Top 59.0% on sourcepulse
QuIP# is a weight-only post-training quantization method designed for extreme compression of Large Language Models (LLMs) down to 4 bits per weight or less. It targets researchers and practitioners seeking to deploy LLMs with significantly reduced memory footprints and improved inference speeds, offering state-of-the-art performance in highly compressed regimes.
How It Works
QuIP# employs a novel approach combining a randomized Hadamard transform (RHT) for efficient incoherence processing, $E_8$ lattice-based codebooks for fast vector quantization, and a fine-tuning scheme to capture inter-layer dependencies. This combination allows for superior quantization quality, particularly at very low bitrates, outperforming theoretical lossless methods at 4 bits with its 3-bit models.
Quick Start & Requirements
pip install -r requirements.txt
and build CUDA inference kernels (cd quiptools && python setup.py install
).Highlighted Details
Maintenance & Community
This codebase is no longer under active development, with QTIP being the successor method. Users are encouraged to open GitHub tickets for questions. Pre-quantized models are available on Hugging Face.
Licensing & Compatibility
The code is licensed under GNU GPL v3. Use of underlying LLM models (Llama, Mistral) is governed by their respective licenses. The GPLv3 license may impose copyleft restrictions on derivative works.
Limitations & Caveats
The project is not under active development. Optimized CUDA kernels for 1-bit matrix-vector multiplication are missing, impacting 3-bit inference speed. While adaptable to non-Llama architectures, it requires manual script modification.
9 months ago
1 day