GPTQ-triton  by fpgaminer

Triton kernel for GPTQ inference, improving context scaling

created 2 years ago
303 stars

Top 89.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a Triton kernel for GPTQ inference, aiming to improve performance scaling with context length compared to existing CUDA kernels. It's designed for researchers and engineers working with large language models who need efficient inference on quantized models.

How It Works

The implementation leverages Triton, a language for writing high-performance kernels on GPUs. It decodes quantized weights on-the-fly before matrix multiplication and fuses Feed-Forward Network (FFN) layers and Query-Key-Value (QKV) matrices. This approach is intended to overcome the performance degradation observed with large context lengths in other GPTQ implementations.

Quick Start & Requirements

  • Install via pip install .
  • Requires a nightly build of the transformers library or commit 28f26c107b4a1c5c7e32ed4d9575622da0627a40.
  • Quantization script quantize.py is available for preparing models.
  • Benchmarking and perplexity scripts are included.

Highlighted Details

  • Achieves 1.70 it/s for LLaMA-7B (4-bit, group-size -1) on a 3090, outperforming GPTQ CUDA (0.11 it/s) and matching FP16 (1.64 it/s) while using significantly less memory (6323 MiB vs 17373 MiB).
  • Demonstrates competitive perplexity scores across Wikitext2, PTB, and C4 datasets compared to FP16 and GPTQ CUDA.
  • Supports group-size quantization for a trade-off between accuracy and memory.
  • Includes scripts for benchmarking, perplexity calculation, and model quantization.

Maintenance & Community

  • The project is a personal attempt by the author. No specific community channels or roadmap are mentioned.

Licensing & Compatibility

  • The README does not explicitly state a license. The code is based on GPTQ-for-LLaMA and GPTQ, which have different licenses (Apache 2.0 and MIT respectively). Clarification is needed for commercial use.

Limitations & Caveats

  • The Triton kernel is currently only implemented for 4-bit quantization.
  • The act-order parameter's functionality is not fully explained in the README.
  • Compatibility with specific transformers versions is noted as critical.
Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

llm-awq by mit-han-lab

0.4%
3k
Weight quantization research paper for LLM compression/acceleration
created 2 years ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Feedback? Help us improve.