GPTQModel  by ModelCloud

LLM compression toolkit for accelerated CPU/GPU inference

created 1 year ago
705 stars

Top 49.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This toolkit provides production-ready GPTQ model compression and quantization, targeting efficient CPU/GPU inference via integrations with Hugging Face Transformers, vLLM, and SGLang. It's designed for researchers and engineers needing to deploy large language models with reduced memory footprints and improved inference speeds.

How It Works

GPTQModel implements GPTQ-based quantization, supporting various bit-widths and group sizes. It leverages optimized kernels like Marlin, Exllama v2, and Triton for accelerated inference. The toolkit offers flexible quantization methods, including Intel's AutoRound and QBits, and supports dynamic per-layer quantization for further memory reduction.

Quick Start & Requirements

  • Install: pip install -v --no-build-isolation gptqmodel[auto_round,vllm,sglang,bitblas,qbits]
  • Prerequisites: Linux OS, CUDA capability >= 6.0 Nvidia GPU. WSL on Windows is supported. ROCm/AMD support is planned for future releases.
  • Resources: Quantization requires significant RAM, depending on model size. Inference is optimized for GPU.
  • Docs: https://github.com/ModelCloud/GPTQModel

Highlighted Details

  • Supports over 30 LLM architectures, including recent models like IBM Granite, Llama 3.2 Vision, and DeepSeek-V2.
  • Offers 100% CI coverage for all supported models, including quality/PPL regression tests.
  • Integrates with vLLM and SGLang for optimized dynamic batching inference.
  • Features Intel/AutoRound and Intel/QBits support for potentially higher quantization quality and CPU inference.

Maintenance & Community

The project is actively maintained with frequent updates and new model support. Community engagement is encouraged via GitHub PRs for new model additions.

Licensing & Compatibility

The project is licensed under Apache 2.0, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

Currently, only Linux and Windows (via WSL) are officially supported platforms. ROCm/AMD GPU support is not yet available. Some newer models may have partial support for specific quantization layers.

Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
15
Star History
201 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 15 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Feedback? Help us improve.