LLM compression toolkit for accelerated CPU/GPU inference
Top 49.5% on sourcepulse
This toolkit provides production-ready GPTQ model compression and quantization, targeting efficient CPU/GPU inference via integrations with Hugging Face Transformers, vLLM, and SGLang. It's designed for researchers and engineers needing to deploy large language models with reduced memory footprints and improved inference speeds.
How It Works
GPTQModel implements GPTQ-based quantization, supporting various bit-widths and group sizes. It leverages optimized kernels like Marlin, Exllama v2, and Triton for accelerated inference. The toolkit offers flexible quantization methods, including Intel's AutoRound and QBits, and supports dynamic per-layer quantization for further memory reduction.
Quick Start & Requirements
pip install -v --no-build-isolation gptqmodel[auto_round,vllm,sglang,bitblas,qbits]
Highlighted Details
Maintenance & Community
The project is actively maintained with frequent updates and new model support. Community engagement is encouraged via GitHub PRs for new model additions.
Licensing & Compatibility
The project is licensed under Apache 2.0, allowing for commercial use and integration with closed-source applications.
Limitations & Caveats
Currently, only Linux and Windows (via WSL) are officially supported platforms. ROCm/AMD GPU support is not yet available. Some newer models may have partial support for specific quantization layers.
2 days ago
1 day