SqueezeLLM  by SqueezeAILab

Quantization framework for efficient LLM serving (ICML 2024 paper)

Created 2 years ago
703 stars

Top 48.5% on SourcePulse

GitHubView on GitHub
Project Summary

SqueezeLLM offers a post-training quantization framework for efficient Large Language Model (LLM) serving, targeting researchers and engineers seeking to reduce memory footprints without sacrificing performance. It introduces Dense-and-Sparse Quantization, a novel method that splits weights into a heavily quantized dense component and a sparse component preserving sensitive weights, enabling larger models to run on less memory with comparable or improved accuracy.

How It Works

SqueezeLLM employs a unique Dense-and-Sparse Quantization strategy. This approach partitions LLM weight matrices into two parts: a dense component that can tolerate aggressive quantization (e.g., 3-bit or 4-bit) with minimal accuracy loss, and a sparse component that retains critical weight values. This hybrid method aims to achieve significant memory reduction while maintaining model fidelity, outperforming naive quantization techniques.

Quick Start & Requirements

  • Install:
    conda create --name sqllm python=3.9 -y
    conda activate sqllm
    git clone https://github.com/SqueezeAILab/SqueezeLLM
    cd SqueezeLLM
    pip install -e .
    cd squeezellm
    python setup_cuda.py install
    
  • Prerequisites: Python 3.9, CUDA 11.3, CUDNN 8.2. Tested on A5000/A6000 GPUs.
  • Resources: Requires downloading pre-quantized model checkpoints (e.g., .pt files). Original models may be needed for LLaMA (v1) and Vicuna v1.1.
  • Docs: Paper, Custom Model Quantization

Highlighted Details

  • Supports 3-bit and 4-bit quantization with sparsity levels of 0%, 0.05%, and 0.45%.
  • Achieves 2% higher MMLU on Vicuna models with 2x smaller memory footprint compared to FP16.
  • Integrated into the official vLLM framework.
  • Supports LLaMA (v1, v2), Vicuna (v1.1, v1.3), Mistral, XGen (8k seq length), and OPT models.

Maintenance & Community

  • Active development with recent updates including Mistral support and custom model quantization.
  • Code reuses components from GPTQ and GPTQ-For-LLaMA.
  • Citation: Kim et al., "SqueezeLLM: Dense-and-Sparse Quantization" (ICML 2024).

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Quantization for Vicuna-30B-v1.3 is listed as "Coming Soon."
  • Reproducing paper results requires specific flags (--torch_profile, --eval) and potentially original model weights for older versions.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

gptq by IST-DASLab

0.1%
2k
Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers
Created 2 years ago
Updated 1 year ago
Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

AQLM by Vahe1994

0.4%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
Created 1 year ago
Updated 1 month ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
5 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.