GPTQ-for-LLaMa by qwopqwop200

4-bit quantization for LLaMA models using GPTQ

Created 2 years ago

3,078 stars

Top 15.4% on SourcePulse

View on GitHub

7 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Yaowei Zheng

Author of LLaMA-Factory

Luca Soldaini

Research Scientist at Ai2

and 3 more!

Project Summary

This repository provides a Python implementation for quantizing LLaMA models to 4-bit precision using the GPTQ algorithm. It targets researchers and practitioners looking to reduce the memory footprint and computational requirements of large language models for inference, particularly on consumer hardware.

How It Works

The project implements the GPTQ (Generative Pre-trained Transformer Quantization) method, a state-of-the-art one-shot weight quantization technique. It quantizes model weights to 4 bits, significantly reducing memory usage while aiming to minimize performance degradation. The implementation supports varying group sizes for quantization, with 128-group size generally recommended for a good balance between compression and accuracy.

Quick Start & Requirements

Installation: Requires Python 3.9 and PyTorch (tested with v2.0.0+cu117). Installation via Conda is recommended:

conda create --name gptq python=3.9 -y
conda activate gptq
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMA
cd GPTQ-for-LLaMA
pip install -r requirements.txt

Prerequisites: Linux OS or WSL2 is required due to Triton kernel dependencies. NVIDIA GPU with CUDA 11.7 is recommended. Large amounts of CPU memory are needed for the quantization process.
Resources: Quantization requires substantial CPU RAM. Inference with quantized models significantly reduces GPU VRAM requirements.

Highlighted Details

Achieves 4-bit quantization for LLaMA models (7B, 13B, 33B, 65B).
Demonstrates competitive performance against FP16 and other quantization methods like NF4 on benchmarks like Wikitext2 and C4 perplexity.
Offers options for saving quantized models in .pt or .safetensors formats.
Includes scripts for converting LLaMA weights to Hugging Face format and performing inference.

Maintenance & Community

The project is based on GPTQ and acknowledges contributions from Meta AI and the GPTQ-triton project. The primary developer recommends using AutoGPTQ for newer developments.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it is based on other projects, and users should verify licensing for commercial use.

Limitations & Caveats

The project explicitly states it only supports Linux (or WSL2) due to Triton kernel dependencies. Quantization requires significant CPU memory. Performance degradation increases with model size, and the project is superseded by AutoGPTQ.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days