Weight compression research paper for near-lossless LLM quantization
Top 59.3% on sourcepulse
SpQR offers near-lossless compression for Large Language Models (LLMs) by employing a novel sparse-quantized representation. This repository is for researchers and engineers seeking to reduce LLM memory footprint and potentially accelerate inference, targeting models like LLaMA, Falcon, and OPT.
How It Works
SpQR achieves compression through a two-stage process: first, it identifies and isolates outlier weights, then it quantizes the remaining weights using a sparse-quantized format. This approach preserves model accuracy by treating outliers separately, leading to near-lossless compression. The method is detailed in the paper "SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression."
Quick Start & Requirements
pip install -r requirements.txt
torch>=2.0.0
with CUDA support. Transformers version 4.28.dev0
(commit 464d420775
) is recommended for reproducibility.Highlighted Details
Maintenance & Community
No specific community channels or active maintainer information is provided in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README states that efficient inference code will be added soon, indicating it's not fully implemented. The current version of the evaluation script only supports LLaMA/Falcon quantization. The setup and benchmarking require significant GPU and system RAM resources.
7 months ago
1 week