smoothquant  by mit-han-lab

Post-training quantization research paper for large language models

Created 2 years ago
1,500 stars

Top 27.6% on SourcePulse

GitHubView on GitHub
Project Summary

SmoothQuant addresses the challenge of running large language models (LLMs) efficiently by enabling accurate and fast 8-bit weight and 8-bit activation (W8A8) post-training quantization. It targets researchers and practitioners seeking to reduce LLM memory footprint and inference latency with minimal accuracy loss, offering a turn-key solution for hardware cost reduction and LLM democratization.

How It Works

SmoothQuant employs a novel approach to migrate quantization difficulty from activations to weights. By identifying and smoothing activation outliers offline, it makes both weights and activations amenable to 8-bit quantization. This mathematically equivalent transformation allows for INT8 quantization across all matrix multiplications in LLMs, unlike previous methods that struggled with activation outliers or lacked hardware efficiency.

Quick Start & Requirements

  • Install: pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113 followed by pip install transformers==4.36.0 accelerate datasets zstandard python setup.py install.
  • Prerequisites: Python 3.8, PyTorch 1.12.1 with CUDA 11.3.
  • Usage: Load pre-quantized models from Hugging Face (e.g., mit-han-lab/opt-30b-smoothquant) using Int8OPTForCausalLM.from_pretrained(). Scripts for smoothing, quantization, and evaluation are provided.
  • Links: Paper, Slides, Video, Hugging Face Models

Highlighted Details

  • Enables W8A8 quantization for models like Llama, Mistral, Falcon, and OPT with negligible accuracy loss.
  • Achieves up to 1.56x speedup and 2x memory reduction compared to FP16.
  • Integrated into major serving frameworks: ONNX Runtime, Amazon SageMaker, NVIDIA TensorRT-LLM, and Intel Neural-Compressor.
  • Demonstrates faster inference than LLM.int8() and enables serving larger models with fewer GPUs.

Maintenance & Community

The project is associated with the MIT-HAN-LAB. Integrations into major industry frameworks suggest active development and adoption.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The provided PyTorch implementation requires specific older versions of PyTorch and CUDA. For larger models or optimal performance, integration with FasterTransformer is recommended. The README does not detail specific hardware requirements beyond CUDA for the PyTorch path.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
29 stars in the last 30 days

Explore Similar Projects

Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

neural-compressor by intel

0.2%
2k
Python library for model compression (quantization, pruning, distillation, NAS)
Created 5 years ago
Updated 14 hours ago
Feedback? Help us improve.