4-bit quantization for LLaMA models using GPTQ
Top 16.0% on sourcepulse
This repository provides a Python implementation for quantizing LLaMA models to 4-bit precision using the GPTQ algorithm. It targets researchers and practitioners looking to reduce the memory footprint and computational requirements of large language models for inference, particularly on consumer hardware.
How It Works
The project implements the GPTQ (Generative Pre-trained Transformer Quantization) method, a state-of-the-art one-shot weight quantization technique. It quantizes model weights to 4 bits, significantly reducing memory usage while aiming to minimize performance degradation. The implementation supports varying group sizes for quantization, with 128-group size generally recommended for a good balance between compression and accuracy.
Quick Start & Requirements
conda create --name gptq python=3.9 -y
conda activate gptq
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMA
cd GPTQ-for-LLaMA
pip install -r requirements.txt
Highlighted Details
.pt
or .safetensors
formats.Maintenance & Community
The project is based on GPTQ and acknowledges contributions from Meta AI and the GPTQ-triton project. The primary developer recommends using AutoGPTQ for newer developments.
Licensing & Compatibility
The repository does not explicitly state a license in the README. However, it is based on other projects, and users should verify licensing for commercial use.
Limitations & Caveats
The project explicitly states it only supports Linux (or WSL2) due to Triton kernel dependencies. Quantization requires significant CPU memory. Performance degradation increases with model size, and the project is superseded by AutoGPTQ.
1 year ago
1 day