SqueezeLLM  by SqueezeAILab

Quantization framework for efficient LLM serving (ICML 2024 paper)

created 2 years ago
698 stars

Top 49.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

SqueezeLLM offers a post-training quantization framework for efficient Large Language Model (LLM) serving, targeting researchers and engineers seeking to reduce memory footprints without sacrificing performance. It introduces Dense-and-Sparse Quantization, a novel method that splits weights into a heavily quantized dense component and a sparse component preserving sensitive weights, enabling larger models to run on less memory with comparable or improved accuracy.

How It Works

SqueezeLLM employs a unique Dense-and-Sparse Quantization strategy. This approach partitions LLM weight matrices into two parts: a dense component that can tolerate aggressive quantization (e.g., 3-bit or 4-bit) with minimal accuracy loss, and a sparse component that retains critical weight values. This hybrid method aims to achieve significant memory reduction while maintaining model fidelity, outperforming naive quantization techniques.

Quick Start & Requirements

  • Install:
    conda create --name sqllm python=3.9 -y
    conda activate sqllm
    git clone https://github.com/SqueezeAILab/SqueezeLLM
    cd SqueezeLLM
    pip install -e .
    cd squeezellm
    python setup_cuda.py install
    
  • Prerequisites: Python 3.9, CUDA 11.3, CUDNN 8.2. Tested on A5000/A6000 GPUs.
  • Resources: Requires downloading pre-quantized model checkpoints (e.g., .pt files). Original models may be needed for LLaMA (v1) and Vicuna v1.1.
  • Docs: Paper, Custom Model Quantization

Highlighted Details

  • Supports 3-bit and 4-bit quantization with sparsity levels of 0%, 0.05%, and 0.45%.
  • Achieves 2% higher MMLU on Vicuna models with 2x smaller memory footprint compared to FP16.
  • Integrated into the official vLLM framework.
  • Supports LLaMA (v1, v2), Vicuna (v1.1, v1.3), Mistral, XGen (8k seq length), and OPT models.

Maintenance & Community

  • Active development with recent updates including Mistral support and custom model quantization.
  • Code reuses components from GPTQ and GPTQ-For-LLaMA.
  • Citation: Kim et al., "SqueezeLLM: Dense-and-Sparse Quantization" (ICML 2024).

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Quantization for Vicuna-30B-v1.3 is listed as "Coming Soon."
  • Reproducing paper results requires specific flags (--torch_profile, --eval) and potentially original model weights for older versions.
Health Check
Last commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Feedback? Help us improve.