SpinQuant  by facebookresearch

Code for research paper on LLM quantization via learned rotations

created 1 year ago
306 stars

Top 88.6% on sourcepulse

GitHubView on GitHub
Project Summary

SpinQuant addresses the challenge of reducing the computational and memory footprint of Large Language Models (LLMs) through advanced quantization techniques. It is designed for researchers and engineers working on LLM deployment and optimization, offering a method to achieve significant compression with minimal accuracy loss.

How It Works

SpinQuant introduces learned rotations, specifically utilizing Cayley transforms, to mitigate the impact of outliers in LLM weights and activations during quantization. This approach differs from static or random rotation methods by learning optimal rotation matrices, thereby improving quantization performance and reducing the accuracy gap compared to full-precision models.

Quick Start & Requirements

  • Installation: Clone the repository, install PyTorch with CUDA support, and then install the fast-hadamard-transform package.
    git clone https://github.com/facebookresearch/SpinQuant.git
    cd SpinQuant
    # Install PyTorch with CUDA from https://pytorch.org/get-started/locally/
    pip install -r requirements.txt
    # Install fast-hadamard-transform
    git clone https://github.com/Dao-AILab/fast-hadamard-transform.git
    cd fast-hadamard-transform
    pip install .
    
  • Prerequisites: Python 3.9, PyTorch >= 2.0 with CUDA support.
  • Usage: Scripts are provided for optimizing rotation matrices (10_optimize_rotation.sh, 11_optimize_rotation_fsdp.sh) and evaluating quantized models (2_eval_ptq.sh). Export to ExecuTorch is also supported (31_optimize_rotation_executorch.sh, 32_eval_ptq_executorch.sh).
  • Resources: Requires access to HuggingFace models (via access_token) and potentially large datasets for evaluation.

Highlighted Details

  • Achieves W4A4KV4 quantization with only a 2.9-point accuracy gap for LLaMA-2 7B on zero-shot reasoning.
  • Outperforms LLM-QAT by 19.1 points and SmoothQuant by 25.0 points in specific benchmarks.
  • Supports exporting quantized models to ExecuTorch for real-time speedups.
  • Provides pre-trained quantized models for Llama-3.2 and Llama-2 variants.

Maintenance & Community

The project is from Meta AI (facebookresearch) and is associated with the paper "SpinQuant: LLM Quantization with Learned Rotations." Contact information for Zechun Liu and Changsheng Zhao is provided.

Licensing & Compatibility

The project is licensed under CC-BY-NC 4.0, which restricts commercial use.

Limitations & Caveats

The CC-BY-NC 4.0 license prohibits commercial use. The README notes that results reported in the paper were run with an internal Meta codebase, and the released code is a reproduction using HuggingFace, which may lead to minor discrepancies.

Health Check
Last commit

5 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
4
Star History
44 stars in the last 90 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Feedback? Help us improve.