GPTQ-for-LLaMa  by qwopqwop200

4-bit quantization for LLaMA models using GPTQ

created 2 years ago
3,061 stars

Top 16.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a Python implementation for quantizing LLaMA models to 4-bit precision using the GPTQ algorithm. It targets researchers and practitioners looking to reduce the memory footprint and computational requirements of large language models for inference, particularly on consumer hardware.

How It Works

The project implements the GPTQ (Generative Pre-trained Transformer Quantization) method, a state-of-the-art one-shot weight quantization technique. It quantizes model weights to 4 bits, significantly reducing memory usage while aiming to minimize performance degradation. The implementation supports varying group sizes for quantization, with 128-group size generally recommended for a good balance between compression and accuracy.

Quick Start & Requirements

  • Installation: Requires Python 3.9 and PyTorch (tested with v2.0.0+cu117). Installation via Conda is recommended:
    conda create --name gptq python=3.9 -y
    conda activate gptq
    conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
    git clone https://github.com/qwopqwop200/GPTQ-for-LLaMA
    cd GPTQ-for-LLaMA
    pip install -r requirements.txt
    
  • Prerequisites: Linux OS or WSL2 is required due to Triton kernel dependencies. NVIDIA GPU with CUDA 11.7 is recommended. Large amounts of CPU memory are needed for the quantization process.
  • Resources: Quantization requires substantial CPU RAM. Inference with quantized models significantly reduces GPU VRAM requirements.

Highlighted Details

  • Achieves 4-bit quantization for LLaMA models (7B, 13B, 33B, 65B).
  • Demonstrates competitive performance against FP16 and other quantization methods like NF4 on benchmarks like Wikitext2 and C4 perplexity.
  • Offers options for saving quantized models in .pt or .safetensors formats.
  • Includes scripts for converting LLaMA weights to Hugging Face format and performing inference.

Maintenance & Community

The project is based on GPTQ and acknowledges contributions from Meta AI and the GPTQ-triton project. The primary developer recommends using AutoGPTQ for newer developments.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it is based on other projects, and users should verify licensing for commercial use.

Limitations & Caveats

The project explicitly states it only supports Linux (or WSL2) due to Triton kernel dependencies. Quantization requires significant CPU memory. Performance degradation increases with model size, and the project is superseded by AutoGPTQ.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 90 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

llm-awq by mit-han-lab

0.4%
3k
Weight quantization research paper for LLM compression/acceleration
created 2 years ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Feedback? Help us improve.