smoothquant by mit-han-lab

Post-training quantization research paper for large language models

Created 3 years ago

1,583 stars

Top 26.2% on SourcePulse

View on GitHub

6 Experts Love This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Zack Li

Cofounder of Nexa AI

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Forrest Iandola

Author of SqueezeNet; Research Scientist at Meta

and 2 more!

Project Summary

SmoothQuant addresses the challenge of running large language models (LLMs) efficiently by enabling accurate and fast 8-bit weight and 8-bit activation (W8A8) post-training quantization. It targets researchers and practitioners seeking to reduce LLM memory footprint and inference latency with minimal accuracy loss, offering a turn-key solution for hardware cost reduction and LLM democratization.

How It Works

SmoothQuant employs a novel approach to migrate quantization difficulty from activations to weights. By identifying and smoothing activation outliers offline, it makes both weights and activations amenable to 8-bit quantization. This mathematically equivalent transformation allows for INT8 quantization across all matrix multiplications in LLMs, unlike previous methods that struggled with activation outliers or lacked hardware efficiency.

Quick Start & Requirements

Install: pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113 followed by pip install transformers==4.36.0 accelerate datasets zstandard python setup.py install.
Prerequisites: Python 3.8, PyTorch 1.12.1 with CUDA 11.3.
Usage: Load pre-quantized models from Hugging Face (e.g., mit-han-lab/opt-30b-smoothquant) using Int8OPTForCausalLM.from_pretrained(). Scripts for smoothing, quantization, and evaluation are provided.
Links: Paper, Slides, Video, Hugging Face Models

Highlighted Details

Enables W8A8 quantization for models like Llama, Mistral, Falcon, and OPT with negligible accuracy loss.
Achieves up to 1.56x speedup and 2x memory reduction compared to FP16.
Integrated into major serving frameworks: ONNX Runtime, Amazon SageMaker, NVIDIA TensorRT-LLM, and Intel Neural-Compressor.
Demonstrates faster inference than LLM.int8() and enables serving larger models with fewer GPUs.

Maintenance & Community

The project is associated with the MIT-HAN-LAB. Integrations into major industry frameworks suggest active development and adoption.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The provided PyTorch implementation requires specific older versions of PyTorch and CUDA. For larger models or optimal performance, integration with FasterTransformer is recommended. The README does not detail specific hardware requirements beyond CUDA for the PyTorch path.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days