AutoGPTQ by AutoGPTQ

LLM quantization package using GPTQ algorithm

Created 2 years ago

5,019 stars

Top 9.9% on SourcePulse

View on GitHub

17 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Kaichao You

Core Maintainer of vLLM

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Omar Sanseviero

DevRel at Google DeepMind

and 13 more!

Project Summary

This package provides an easy-to-use library for quantizing Large Language Models (LLMs) using the GPTQ algorithm, targeting researchers and developers who need to reduce model size and improve inference speed. It offers user-friendly APIs and integrates with popular frameworks like Hugging Face Transformers.

How It Works

AutoGPTQ implements weight-only quantization, reducing model precision to 4-bit integers while maintaining performance. It leverages optimized kernels, including Marlin (for Ampere GPUs) and Exllamav2, to accelerate inference. The library allows for flexible configuration of quantization parameters like group size and activation description.

Quick Start & Requirements

Installation: pip install auto-gptq (with --extra-index-url for specific CUDA/ROCm versions). Install with Triton support via pip install auto-gptq[triton].
Prerequisites: Linux or Windows, CUDA 11.8/12.1 or ROCm 5.7. NVIDIA Maxwell or lower GPUs are not supported.
Resources: Quantization and inference require significant GPU memory, depending on the model size.
Docs: https://github.com/AutoGPTQ/AutoGPTQ

Highlighted Details

Supports Marlin int4*fp16 kernel for faster inference on compatible GPUs (compute capability 8.0+).
Integrated with 🤗 Transformers, Optimum, and PEFT for broader usability.
Offers evaluation tasks for assessing quantized model performance on downstream tasks.
Supports quantization and inference for a wide range of model architectures including Llama, GPT-J, OPT, and Falcon.

Maintenance & Community

Status: The project is marked as unmaintained, recommending GPTQModel for bug fixes and new models.
Community: Links to Discord/Slack are not provided in the README.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is unmaintained, meaning active development and support for new models or bug fixes may cease. The README notes that using few samples for quantization might lead to reduced model quality.

Health Check

Last Commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

22 stars in the last 30 days