GPTQ-triton by fpgaminer

Triton kernel for GPTQ inference, improving context scaling

Created 2 years ago

316 stars

Top 85.6% on SourcePulse

3 Experts Love This Project

jph00

Cofounder of fast.ai

srush

Research Scientist at Cursor; Professor at Cornell Tech

osanseviero

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

This repository provides a Triton kernel for GPTQ inference, aiming to improve performance scaling with context length compared to existing CUDA kernels. It's designed for researchers and engineers working with large language models who need efficient inference on quantized models.

How It Works

The implementation leverages Triton, a language for writing high-performance kernels on GPUs. It decodes quantized weights on-the-fly before matrix multiplication and fuses Feed-Forward Network (FFN) layers and Query-Key-Value (QKV) matrices. This approach is intended to overcome the performance degradation observed with large context lengths in other GPTQ implementations.

Quick Start & Requirements

Install via pip install .
Requires a nightly build of the transformers library or commit 28f26c107b4a1c5c7e32ed4d9575622da0627a40.
Quantization script quantize.py is available for preparing models.
Benchmarking and perplexity scripts are included.

Highlighted Details

Achieves 1.70 it/s for LLaMA-7B (4-bit, group-size -1) on a 3090, outperforming GPTQ CUDA (0.11 it/s) and matching FP16 (1.64 it/s) while using significantly less memory (6323 MiB vs 17373 MiB).
Demonstrates competitive perplexity scores across Wikitext2, PTB, and C4 datasets compared to FP16 and GPTQ CUDA.
Supports group-size quantization for a trade-off between accuracy and memory.
Includes scripts for benchmarking, perplexity calculation, and model quantization.

Maintenance & Community

The project is a personal attempt by the author. No specific community channels or roadmap are mentioned.

Licensing & Compatibility

The README does not explicitly state a license. The code is based on GPTQ-for-LLaMA and GPTQ, which have different licenses (Apache 2.0 and MIT respectively). Clarification is needed for commercial use.

Limitations & Caveats

The Triton kernel is currently only implemented for 4-bit quantization.
The act-order parameter's functionality is not fully explained in the README.
Compatibility with specific transformers versions is noted as critical.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

0 stars in the last 30 days

Explore Similar Projects

Atom by efeslab

Low-bit quantization research paper for efficient LLM serving

Created 2 years ago

Updated 1 year ago

KVQuant by SqueezeAILab

Research paper on KV cache quantization for long context LLM inference

Created 1 year ago

Updated 1 year ago

Sparsebit by megvii-research

Model compression and acceleration toolbox

Created 3 years ago

Updated 2 years ago

Starred by

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

quip-sharp by Cornell-RelaxML

LLM quantization for extreme compression

Created 2 years ago

Updated 1 year ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai),

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs), and

4 more.

optimum-quanto by huggingface

PyTorch quantization backend for Hugging Face models

Created 2 years ago

Updated 1 month ago

Starred by

Alex Chen

Alex Chen(Cofounder of Nexa AI),

Zack Li

Zack Li(Cofounder of Nexa AI), and

1 more.

deepcompressor by nunchaku-tech

Model compression toolbox for LLMs and diffusion models

Created 1 year ago

Updated 5 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Ying Sheng

Ying Sheng(Coauthor of SGLang).

GPTQModel by ModelCloud

LLM compression toolkit for accelerated CPU/GPU inference

Created 1 year ago

Updated 1 day ago

ik_llama.cpp by ikawrakow

`llama.cpp` fork for improved CPU/GPU performance

Created 1 year ago

Updated 1 day ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

6 more.

gptq by IST-DASLab

Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers

Created 3 years ago

Updated 1 year ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face),

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI), and

5 more.

AQLM by Vahe1994

PyTorch code for LLM compression via Additive Quantization (AQLM)

Created 2 years ago

Updated 5 months ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

4 more.

gemma_pytorch by google

PyTorch implementation for Google's Gemma models

Created 1 year ago

Updated 7 months ago

Starred by

Michael Han

Michael Han(Cofounder of Unsloth),

Meng Zhang

Meng Zhang(Cofounder of TabbyML), and

11 more.

lmdeploy by InternLM

Toolkit for LLM compression, deployment, and serving

Created 2 years ago

Updated 2 days ago

Feedback? Help us improve.