Quantization framework for efficient LLM serving (ICML 2024 paper)
Top 49.8% on sourcepulse
SqueezeLLM offers a post-training quantization framework for efficient Large Language Model (LLM) serving, targeting researchers and engineers seeking to reduce memory footprints without sacrificing performance. It introduces Dense-and-Sparse Quantization, a novel method that splits weights into a heavily quantized dense component and a sparse component preserving sensitive weights, enabling larger models to run on less memory with comparable or improved accuracy.
How It Works
SqueezeLLM employs a unique Dense-and-Sparse Quantization strategy. This approach partitions LLM weight matrices into two parts: a dense component that can tolerate aggressive quantization (e.g., 3-bit or 4-bit) with minimal accuracy loss, and a sparse component that retains critical weight values. This hybrid method aims to achieve significant memory reduction while maintaining model fidelity, outperforming naive quantization techniques.
Quick Start & Requirements
conda create --name sqllm python=3.9 -y
conda activate sqllm
git clone https://github.com/SqueezeAILab/SqueezeLLM
cd SqueezeLLM
pip install -e .
cd squeezellm
python setup_cuda.py install
.pt
files). Original models may be needed for LLaMA (v1) and Vicuna v1.1.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
--torch_profile
, --eval
) and potentially original model weights for older versions.11 months ago
Inactive