Post-training quantization research paper for large language models
Top 28.6% on sourcepulse
SmoothQuant addresses the challenge of running large language models (LLMs) efficiently by enabling accurate and fast 8-bit weight and 8-bit activation (W8A8) post-training quantization. It targets researchers and practitioners seeking to reduce LLM memory footprint and inference latency with minimal accuracy loss, offering a turn-key solution for hardware cost reduction and LLM democratization.
How It Works
SmoothQuant employs a novel approach to migrate quantization difficulty from activations to weights. By identifying and smoothing activation outliers offline, it makes both weights and activations amenable to 8-bit quantization. This mathematically equivalent transformation allows for INT8 quantization across all matrix multiplications in LLMs, unlike previous methods that struggled with activation outliers or lacked hardware efficiency.
Quick Start & Requirements
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
followed by pip install transformers==4.36.0 accelerate datasets zstandard python setup.py install
.mit-han-lab/opt-30b-smoothquant
) using Int8OPTForCausalLM.from_pretrained()
. Scripts for smoothing, quantization, and evaluation are provided.Highlighted Details
Maintenance & Community
The project is associated with the MIT-HAN-LAB. Integrations into major industry frameworks suggest active development and adoption.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.
Limitations & Caveats
The provided PyTorch implementation requires specific older versions of PyTorch and CUDA. For larger models or optimal performance, integration with FasterTransformer is recommended. The README does not detail specific hardware requirements beyond CUDA for the PyTorch path.
1 year ago
1 week