AutoGPTQ  by AutoGPTQ

LLM quantization package using GPTQ algorithm

Created 2 years ago
4,946 stars

Top 10.1% on SourcePulse

GitHubView on GitHub
Project Summary

This package provides an easy-to-use library for quantizing Large Language Models (LLMs) using the GPTQ algorithm, targeting researchers and developers who need to reduce model size and improve inference speed. It offers user-friendly APIs and integrates with popular frameworks like Hugging Face Transformers.

How It Works

AutoGPTQ implements weight-only quantization, reducing model precision to 4-bit integers while maintaining performance. It leverages optimized kernels, including Marlin (for Ampere GPUs) and Exllamav2, to accelerate inference. The library allows for flexible configuration of quantization parameters like group size and activation description.

Quick Start & Requirements

  • Installation: pip install auto-gptq (with --extra-index-url for specific CUDA/ROCm versions). Install with Triton support via pip install auto-gptq[triton].
  • Prerequisites: Linux or Windows, CUDA 11.8/12.1 or ROCm 5.7. NVIDIA Maxwell or lower GPUs are not supported.
  • Resources: Quantization and inference require significant GPU memory, depending on the model size.
  • Docs: https://github.com/AutoGPTQ/AutoGPTQ

Highlighted Details

  • Supports Marlin int4*fp16 kernel for faster inference on compatible GPUs (compute capability 8.0+).
  • Integrated with 🤗 Transformers, Optimum, and PEFT for broader usability.
  • Offers evaluation tasks for assessing quantized model performance on downstream tasks.
  • Supports quantization and inference for a wide range of model architectures including Llama, GPT-J, OPT, and Falcon.

Maintenance & Community

  • Status: The project is marked as unmaintained, recommending GPTQModel for bug fixes and new models.
  • Community: Links to Discord/Slack are not provided in the README.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is unmaintained, meaning active development and support for new models or bug fixes may cease. The README notes that using few samples for quantization might lead to reduced model quality.

Health Check
Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 30 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
11 more.

ctransformers by marella

0.1%
2k
Python bindings for fast Transformer model inference
Created 2 years ago
Updated 1 year ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

gptq by IST-DASLab

0.1%
2k
Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers
Created 2 years ago
Updated 1 year ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
5 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
Created 2 years ago
Updated 1 year ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
22 more.

qlora by artidoro

0.1%
11k
Finetuning tool for quantized LLMs
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.