GPTQModel  by ModelCloud

LLM compression toolkit for accelerated CPU/GPU inference

Created 1 year ago
784 stars

Top 44.7% on SourcePulse

GitHubView on GitHub
Project Summary

This toolkit provides production-ready GPTQ model compression and quantization, targeting efficient CPU/GPU inference via integrations with Hugging Face Transformers, vLLM, and SGLang. It's designed for researchers and engineers needing to deploy large language models with reduced memory footprints and improved inference speeds.

How It Works

GPTQModel implements GPTQ-based quantization, supporting various bit-widths and group sizes. It leverages optimized kernels like Marlin, Exllama v2, and Triton for accelerated inference. The toolkit offers flexible quantization methods, including Intel's AutoRound and QBits, and supports dynamic per-layer quantization for further memory reduction.

Quick Start & Requirements

  • Install: pip install -v --no-build-isolation gptqmodel[auto_round,vllm,sglang,bitblas,qbits]
  • Prerequisites: Linux OS, CUDA capability >= 6.0 Nvidia GPU. WSL on Windows is supported. ROCm/AMD support is planned for future releases.
  • Resources: Quantization requires significant RAM, depending on model size. Inference is optimized for GPU.
  • Docs: https://github.com/ModelCloud/GPTQModel

Highlighted Details

  • Supports over 30 LLM architectures, including recent models like IBM Granite, Llama 3.2 Vision, and DeepSeek-V2.
  • Offers 100% CI coverage for all supported models, including quality/PPL regression tests.
  • Integrates with vLLM and SGLang for optimized dynamic batching inference.
  • Features Intel/AutoRound and Intel/QBits support for potentially higher quantization quality and CPU inference.

Maintenance & Community

The project is actively maintained with frequent updates and new model support. Community engagement is encouraged via GitHub PRs for new model additions.

Licensing & Compatibility

The project is licensed under Apache 2.0, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

Currently, only Linux and Windows (via WSL) are officially supported platforms. ROCm/AMD GPU support is not yet available. Some newer models may have partial support for specific quantization layers.

Health Check
Last Commit

13 hours ago

Responsiveness

1 day

Pull Requests (30d)
109
Issues (30d)
26
Star History
49 stars in the last 30 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

AQLM by Vahe1994

0.4%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
Created 1 year ago
Updated 1 month ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.