AutoGPTQ  by AutoGPTQ

LLM quantization package using GPTQ algorithm

created 2 years ago
4,905 stars

Top 10.4% on sourcepulse

GitHubView on GitHub
Project Summary

This package provides an easy-to-use library for quantizing Large Language Models (LLMs) using the GPTQ algorithm, targeting researchers and developers who need to reduce model size and improve inference speed. It offers user-friendly APIs and integrates with popular frameworks like Hugging Face Transformers.

How It Works

AutoGPTQ implements weight-only quantization, reducing model precision to 4-bit integers while maintaining performance. It leverages optimized kernels, including Marlin (for Ampere GPUs) and Exllamav2, to accelerate inference. The library allows for flexible configuration of quantization parameters like group size and activation description.

Quick Start & Requirements

  • Installation: pip install auto-gptq (with --extra-index-url for specific CUDA/ROCm versions). Install with Triton support via pip install auto-gptq[triton].
  • Prerequisites: Linux or Windows, CUDA 11.8/12.1 or ROCm 5.7. NVIDIA Maxwell or lower GPUs are not supported.
  • Resources: Quantization and inference require significant GPU memory, depending on the model size.
  • Docs: https://github.com/AutoGPTQ/AutoGPTQ

Highlighted Details

  • Supports Marlin int4*fp16 kernel for faster inference on compatible GPUs (compute capability 8.0+).
  • Integrated with 🤗 Transformers, Optimum, and PEFT for broader usability.
  • Offers evaluation tasks for assessing quantized model performance on downstream tasks.
  • Supports quantization and inference for a wide range of model architectures including Llama, GPT-J, OPT, and Falcon.

Maintenance & Community

  • Status: The project is marked as unmaintained, recommending GPTQModel for bug fixes and new models.
  • Community: Links to Discord/Slack are not provided in the README.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is unmaintained, meaning active development and support for new models or bug fixes may cease. The README notes that using few samples for quantization might lead to reduced model quality.

Health Check
Last commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
86 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

llm-awq by mit-han-lab

0.4%
3k
Weight quantization research paper for LLM compression/acceleration
created 2 years ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Feedback? Help us improve.