SpinQuant by facebookresearch

Code for research paper on LLM quantization via learned rotations

Created 1 year ago

368 stars

Top 76.7% on SourcePulse

Project Summary

SpinQuant addresses the challenge of reducing the computational and memory footprint of Large Language Models (LLMs) through advanced quantization techniques. It is designed for researchers and engineers working on LLM deployment and optimization, offering a method to achieve significant compression with minimal accuracy loss.

How It Works

SpinQuant introduces learned rotations, specifically utilizing Cayley transforms, to mitigate the impact of outliers in LLM weights and activations during quantization. This approach differs from static or random rotation methods by learning optimal rotation matrices, thereby improving quantization performance and reducing the accuracy gap compared to full-precision models.

Quick Start & Requirements

Installation: Clone the repository, install PyTorch with CUDA support, and then install the fast-hadamard-transform package.

git clone https://github.com/facebookresearch/SpinQuant.git
cd SpinQuant
# Install PyTorch with CUDA from https://pytorch.org/get-started/locally/
pip install -r requirements.txt
# Install fast-hadamard-transform
git clone https://github.com/Dao-AILab/fast-hadamard-transform.git
cd fast-hadamard-transform
pip install .

Prerequisites: Python 3.9, PyTorch >= 2.0 with CUDA support.
Usage: Scripts are provided for optimizing rotation matrices (10_optimize_rotation.sh, 11_optimize_rotation_fsdp.sh) and evaluating quantized models (2_eval_ptq.sh). Export to ExecuTorch is also supported (31_optimize_rotation_executorch.sh, 32_eval_ptq_executorch.sh).
Resources: Requires access to HuggingFace models (via access_token) and potentially large datasets for evaluation.

Highlighted Details

Achieves W4A4KV4 quantization with only a 2.9-point accuracy gap for LLaMA-2 7B on zero-shot reasoning.
Outperforms LLM-QAT by 19.1 points and SmoothQuant by 25.0 points in specific benchmarks.
Supports exporting quantized models to ExecuTorch for real-time speedups.
Provides pre-trained quantized models for Llama-3.2 and Llama-2 variants.

Maintenance & Community

The project is from Meta AI (facebookresearch) and is associated with the paper "SpinQuant: LLM Quantization with Learned Rotations." Contact information for Zechun Liu and Changsheng Zhao is provided.

Licensing & Compatibility

The project is licensed under CC-BY-NC 4.0, which restricts commercial use.

Limitations & Caveats

The CC-BY-NC 4.0 license prohibits commercial use. The README notes that results reported in the paper were run with an internal Meta codebase, and the released code is a reproduction using HuggingFace, which may lead to minor discrepancies.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days