gptq  by IST-DASLab

Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers

Created 3 years ago
2,289 stars

Top 19.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides an efficient implementation of the GPTQ algorithm for post-training quantization of large generative language models. It enables significant model compression (to 2, 3, or 4 bits) with minimal accuracy loss, targeting researchers and practitioners working with large transformer models who need to reduce memory footprint and inference costs.

How It Works

GPTQ employs a layer-wise quantization approach that minimizes the error introduced by quantizing weights. It achieves high accuracy by considering the Hessian of the loss function and using an approximate second-order information to select which weights to quantize. The implementation includes optimizations like weight grouping and an optional act-order heuristic that prioritizes quantizing columns with larger activation magnitudes, further improving accuracy, especially for smaller models.

Quick Start & Requirements

  • Install: python setup_cuda.py install for CUDA kernels, then use provided Python scripts (e.g., opt.py, bloom.py, llama.py).
  • Prerequisites: PyTorch (v1.10.1+cu111), Transformers (v4.21.2), Datasets (v1.17.0), CUDA (tested with 11.4), SentencePiece (for LLaMa). Requires NVIDIA GPU with CUDA support.
  • Resources: Experiments were run on an A100 80GB, but smaller GPUs are generally supported.
  • Docs: Official Paper

Highlighted Details

  • Achieves state-of-the-art perplexity scores on benchmarks like Wiki2 for models like LLaMa.
  • Offers optimized 3-bit CUDA kernels with significant speedups (e.g., 1.9x to 3.25x on OPT-175B).
  • Introduces act-order and true-sequential options for improved accuracy on models like LLaMa-7B.
  • Supports compression for OPT and BLOOM model families, with LLaMa integration available.

Maintenance & Community

The project is associated with the ICLR 2023 paper "GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers" by IST-DASLab. No specific community channels or active development updates are highlighted in the README.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The 3-bit CUDA kernels are specifically optimized for OPT-175B on A100 or A6000 GPUs and may yield suboptimal performance on other configurations. LLaMa integration requires installing from source.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
32 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
5 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.