SqueezeLLM by SqueezeAILab

Quantization framework for efficient LLM serving (ICML 2024 paper)

Created 2 years ago

712 stars

Top 48.2% on SourcePulse

View on GitHub

3 Experts Love This Project

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

SqueezeLLM offers a post-training quantization framework for efficient Large Language Model (LLM) serving, targeting researchers and engineers seeking to reduce memory footprints without sacrificing performance. It introduces Dense-and-Sparse Quantization, a novel method that splits weights into a heavily quantized dense component and a sparse component preserving sensitive weights, enabling larger models to run on less memory with comparable or improved accuracy.

How It Works

SqueezeLLM employs a unique Dense-and-Sparse Quantization strategy. This approach partitions LLM weight matrices into two parts: a dense component that can tolerate aggressive quantization (e.g., 3-bit or 4-bit) with minimal accuracy loss, and a sparse component that retains critical weight values. This hybrid method aims to achieve significant memory reduction while maintaining model fidelity, outperforming naive quantization techniques.

Quick Start & Requirements

Install:

conda create --name sqllm python=3.9 -y
conda activate sqllm
git clone https://github.com/SqueezeAILab/SqueezeLLM
cd SqueezeLLM
pip install -e .
cd squeezellm
python setup_cuda.py install

Prerequisites: Python 3.9, CUDA 11.3, CUDNN 8.2. Tested on A5000/A6000 GPUs.
Resources: Requires downloading pre-quantized model checkpoints (e.g., .pt files). Original models may be needed for LLaMA (v1) and Vicuna v1.1.
Docs: Paper, Custom Model Quantization

Highlighted Details

Supports 3-bit and 4-bit quantization with sparsity levels of 0%, 0.05%, and 0.45%.
Achieves 2% higher MMLU on Vicuna models with 2x smaller memory footprint compared to FP16.
Integrated into the official vLLM framework.
Supports LLaMA (v1, v2), Vicuna (v1.1, v1.3), Mistral, XGen (8k seq length), and OPT models.

Maintenance & Community

Active development with recent updates including Mistral support and custom model quantization.
Code reuses components from GPTQ and GPTQ-For-LLaMA.
Citation: Kim et al., "SqueezeLLM: Dense-and-Sparse Quantization" (ICML 2024).

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Quantization for Vicuna-30B-v1.3 is listed as "Coming Soon."
Reproducing paper results requires specific flags (--torch_profile, --eval) and potentially original model weights for older versions.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days