Triton kernel for GPTQ inference, improving context scaling
Top 89.1% on sourcepulse
This repository provides a Triton kernel for GPTQ inference, aiming to improve performance scaling with context length compared to existing CUDA kernels. It's designed for researchers and engineers working with large language models who need efficient inference on quantized models.
How It Works
The implementation leverages Triton, a language for writing high-performance kernels on GPUs. It decodes quantized weights on-the-fly before matrix multiplication and fuses Feed-Forward Network (FFN) layers and Query-Key-Value (QKV) matrices. This approach is intended to overcome the performance degradation observed with large context lengths in other GPTQ implementations.
Quick Start & Requirements
pip install .
transformers
library or commit 28f26c107b4a1c5c7e32ed4d9575622da0627a40
.quantize.py
is available for preparing models.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
act-order
parameter's functionality is not fully explained in the README.transformers
versions is noted as critical.2 years ago
1 day