Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers
Top 21.4% on sourcepulse
This repository provides an efficient implementation of the GPTQ algorithm for post-training quantization of large generative language models. It enables significant model compression (to 2, 3, or 4 bits) with minimal accuracy loss, targeting researchers and practitioners working with large transformer models who need to reduce memory footprint and inference costs.
How It Works
GPTQ employs a layer-wise quantization approach that minimizes the error introduced by quantizing weights. It achieves high accuracy by considering the Hessian of the loss function and using an approximate second-order information to select which weights to quantize. The implementation includes optimizations like weight grouping and an optional act-order
heuristic that prioritizes quantizing columns with larger activation magnitudes, further improving accuracy, especially for smaller models.
Quick Start & Requirements
python setup_cuda.py install
for CUDA kernels, then use provided Python scripts (e.g., opt.py
, bloom.py
, llama.py
).Highlighted Details
act-order
and true-sequential
options for improved accuracy on models like LLaMa-7B.Maintenance & Community
The project is associated with the ICLR 2023 paper "GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers" by IST-DASLab. No specific community channels or active development updates are highlighted in the README.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.
Limitations & Caveats
The 3-bit CUDA kernels are specifically optimized for OPT-175B on A100 or A6000 GPUs and may yield suboptimal performance on other configurations. LLaMa integration requires installing from source.
1 year ago
1+ week