GPTQModel by ModelCloud

LLM compression toolkit for accelerated CPU/GPU inference

Created 1 year ago

966 stars

Top 38.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Ying Sheng

Coauthor of SGLang

Project Summary

This toolkit provides production-ready GPTQ model compression and quantization, targeting efficient CPU/GPU inference via integrations with Hugging Face Transformers, vLLM, and SGLang. It's designed for researchers and engineers needing to deploy large language models with reduced memory footprints and improved inference speeds.

How It Works

GPTQModel implements GPTQ-based quantization, supporting various bit-widths and group sizes. It leverages optimized kernels like Marlin, Exllama v2, and Triton for accelerated inference. The toolkit offers flexible quantization methods, including Intel's AutoRound and QBits, and supports dynamic per-layer quantization for further memory reduction.

Quick Start & Requirements

Install: pip install -v --no-build-isolation gptqmodel[auto_round,vllm,sglang,bitblas,qbits]
Prerequisites: Linux OS, CUDA capability >= 6.0 Nvidia GPU. WSL on Windows is supported. ROCm/AMD support is planned for future releases.
Resources: Quantization requires significant RAM, depending on model size. Inference is optimized for GPU.
Docs: https://github.com/ModelCloud/GPTQModel

Highlighted Details

Supports over 30 LLM architectures, including recent models like IBM Granite, Llama 3.2 Vision, and DeepSeek-V2.
Offers 100% CI coverage for all supported models, including quality/PPL regression tests.
Integrates with vLLM and SGLang for optimized dynamic batching inference.
Features Intel/AutoRound and Intel/QBits support for potentially higher quantization quality and CPU inference.

Maintenance & Community

The project is actively maintained with frequent updates and new model support. Community engagement is encouraged via GitHub PRs for new model additions.

Licensing & Compatibility

The project is licensed under Apache 2.0, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

Currently, only Linux and Windows (via WSL) are officially supported platforms. ROCm/AMD GPU support is not yet available. Some newer models may have partial support for specific quantization layers.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

44 stars in the last 30 days