LLM quantization package using GPTQ algorithm
Top 10.4% on sourcepulse
This package provides an easy-to-use library for quantizing Large Language Models (LLMs) using the GPTQ algorithm, targeting researchers and developers who need to reduce model size and improve inference speed. It offers user-friendly APIs and integrates with popular frameworks like Hugging Face Transformers.
How It Works
AutoGPTQ implements weight-only quantization, reducing model precision to 4-bit integers while maintaining performance. It leverages optimized kernels, including Marlin (for Ampere GPUs) and Exllamav2, to accelerate inference. The library allows for flexible configuration of quantization parameters like group size and activation description.
Quick Start & Requirements
pip install auto-gptq
(with --extra-index-url
for specific CUDA/ROCm versions). Install with Triton support via pip install auto-gptq[triton]
.Highlighted Details
Maintenance & Community
GPTQModel
for bug fixes and new models.Licensing & Compatibility
Limitations & Caveats
The project is unmaintained, meaning active development and support for new models or bug fixes may cease. The README notes that using few samples for quantization might lead to reduced model quality.
3 months ago
Inactive