gptq  by IST-DASLab

Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers

created 2 years ago
2,152 stars

Top 21.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an efficient implementation of the GPTQ algorithm for post-training quantization of large generative language models. It enables significant model compression (to 2, 3, or 4 bits) with minimal accuracy loss, targeting researchers and practitioners working with large transformer models who need to reduce memory footprint and inference costs.

How It Works

GPTQ employs a layer-wise quantization approach that minimizes the error introduced by quantizing weights. It achieves high accuracy by considering the Hessian of the loss function and using an approximate second-order information to select which weights to quantize. The implementation includes optimizations like weight grouping and an optional act-order heuristic that prioritizes quantizing columns with larger activation magnitudes, further improving accuracy, especially for smaller models.

Quick Start & Requirements

  • Install: python setup_cuda.py install for CUDA kernels, then use provided Python scripts (e.g., opt.py, bloom.py, llama.py).
  • Prerequisites: PyTorch (v1.10.1+cu111), Transformers (v4.21.2), Datasets (v1.17.0), CUDA (tested with 11.4), SentencePiece (for LLaMa). Requires NVIDIA GPU with CUDA support.
  • Resources: Experiments were run on an A100 80GB, but smaller GPUs are generally supported.
  • Docs: Official Paper

Highlighted Details

  • Achieves state-of-the-art perplexity scores on benchmarks like Wiki2 for models like LLaMa.
  • Offers optimized 3-bit CUDA kernels with significant speedups (e.g., 1.9x to 3.25x on OPT-175B).
  • Introduces act-order and true-sequential options for improved accuracy on models like LLaMa-7B.
  • Supports compression for OPT and BLOOM model families, with LLaMa integration available.

Maintenance & Community

The project is associated with the ICLR 2023 paper "GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers" by IST-DASLab. No specific community channels or active development updates are highlighted in the README.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The 3-bit CUDA kernels are specifically optimized for OPT-175B on A100 or A6000 GPUs and may yield suboptimal performance on other configurations. LLaMa integration requires installing from source.

Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
1
Issues (30d)
1
Star History
66 stars in the last 90 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

llm-awq by mit-han-lab

0.4%
3k
Weight quantization research paper for LLM compression/acceleration
created 2 years ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Feedback? Help us improve.