gptq by IST-DASLab

Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers

Created 3 years ago

2,243 stars

Top 20.0% on SourcePulse

View on GitHub

8 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Jeff Hammerbacher

Cofounder of Cloudera

and 4 more!

Project Summary

This repository provides an efficient implementation of the GPTQ algorithm for post-training quantization of large generative language models. It enables significant model compression (to 2, 3, or 4 bits) with minimal accuracy loss, targeting researchers and practitioners working with large transformer models who need to reduce memory footprint and inference costs.

How It Works

GPTQ employs a layer-wise quantization approach that minimizes the error introduced by quantizing weights. It achieves high accuracy by considering the Hessian of the loss function and using an approximate second-order information to select which weights to quantize. The implementation includes optimizations like weight grouping and an optional act-order heuristic that prioritizes quantizing columns with larger activation magnitudes, further improving accuracy, especially for smaller models.

Quick Start & Requirements

Install: python setup_cuda.py install for CUDA kernels, then use provided Python scripts (e.g., opt.py, bloom.py, llama.py).
Prerequisites: PyTorch (v1.10.1+cu111), Transformers (v4.21.2), Datasets (v1.17.0), CUDA (tested with 11.4), SentencePiece (for LLaMa). Requires NVIDIA GPU with CUDA support.
Resources: Experiments were run on an A100 80GB, but smaller GPUs are generally supported.
Docs: Official Paper

Highlighted Details

Achieves state-of-the-art perplexity scores on benchmarks like Wiki2 for models like LLaMa.
Offers optimized 3-bit CUDA kernels with significant speedups (e.g., 1.9x to 3.25x on OPT-175B).
Introduces act-order and true-sequential options for improved accuracy on models like LLaMa-7B.
Supports compression for OPT and BLOOM model families, with LLaMa integration available.

Maintenance & Community

The project is associated with the ICLR 2023 paper "GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers" by IST-DASLab. No specific community channels or active development updates are highlighted in the README.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The 3-bit CUDA kernels are specifically optimized for OPT-175B on A100 or A6000 GPUs and may yield suboptimal performance on other configurations. LLaMa integration requires installing from source.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days