smoothquant  by mit-han-lab

Post-training quantization research paper for large language models

created 2 years ago
1,461 stars

Top 28.6% on sourcepulse

GitHubView on GitHub
Project Summary

SmoothQuant addresses the challenge of running large language models (LLMs) efficiently by enabling accurate and fast 8-bit weight and 8-bit activation (W8A8) post-training quantization. It targets researchers and practitioners seeking to reduce LLM memory footprint and inference latency with minimal accuracy loss, offering a turn-key solution for hardware cost reduction and LLM democratization.

How It Works

SmoothQuant employs a novel approach to migrate quantization difficulty from activations to weights. By identifying and smoothing activation outliers offline, it makes both weights and activations amenable to 8-bit quantization. This mathematically equivalent transformation allows for INT8 quantization across all matrix multiplications in LLMs, unlike previous methods that struggled with activation outliers or lacked hardware efficiency.

Quick Start & Requirements

  • Install: pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113 followed by pip install transformers==4.36.0 accelerate datasets zstandard python setup.py install.
  • Prerequisites: Python 3.8, PyTorch 1.12.1 with CUDA 11.3.
  • Usage: Load pre-quantized models from Hugging Face (e.g., mit-han-lab/opt-30b-smoothquant) using Int8OPTForCausalLM.from_pretrained(). Scripts for smoothing, quantization, and evaluation are provided.
  • Links: Paper, Slides, Video, Hugging Face Models

Highlighted Details

  • Enables W8A8 quantization for models like Llama, Mistral, Falcon, and OPT with negligible accuracy loss.
  • Achieves up to 1.56x speedup and 2x memory reduction compared to FP16.
  • Integrated into major serving frameworks: ONNX Runtime, Amazon SageMaker, NVIDIA TensorRT-LLM, and Intel Neural-Compressor.
  • Demonstrates faster inference than LLM.int8() and enables serving larger models with fewer GPUs.

Maintenance & Community

The project is associated with the MIT-HAN-LAB. Integrations into major industry frameworks suggest active development and adoption.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The provided PyTorch implementation requires specific older versions of PyTorch and CUDA. For larger models or optimal performance, integration with FasterTransformer is recommended. The README does not detail specific hardware requirements beyond CUDA for the PyTorch path.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
75 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 17 hours ago
Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

llm-awq by mit-han-lab

0.4%
3k
Weight quantization research paper for LLM compression/acceleration
created 2 years ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Feedback? Help us improve.