gemlite by dropbox

Triton kernels for efficient low-bit matrix multiplication

Created 1 year ago

419 stars

Top 70.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Lianmin Zheng

Coauthor of SGLang, vLLM

Project Summary

GemLite provides a collection of Triton kernels for efficient low-bit matrix multiplication, targeting developers and researchers working with large language models and other deep learning applications. It offers significant speedups for prefill and decoding operations by optimizing weight quantization and computation, aiming to simplify the integration of high-performance kernels.

How It Works

GemLite leverages Triton, a Python-based language for writing high-performance GPU kernels, to implement various matrix multiplication strategies including GEMV, GEMM, GEMM Split-K, and a novel GEMV RevSplit-K. This approach allows for flexibility in hardware optimization and bit-packing (8, 4, 2, 1-bit weights) and supports multiple activation precisions (FP16, BF16, FP8, INT8). The kernels are designed to maximize performance across different matrix shapes and batch sizes, with features like autotune caching to accelerate kernel selection.

Quick Start & Requirements

Installation: pip install gemlite
Prerequisites: CUDA-enabled GPU.
Usage: Import gemlite and GemLiteLinear for custom kernel instantiation, or use helper functions like A16W8 for pre-configured quantization.
Resources: Official Documentation

Highlighted Details

Achieves up to 7-8x faster prefill and 3-6x faster decoding compared to default Torch AO kernels.
Supports bfloat16 precision and integration with vLLM via hqq lib.
Features flexible bitpacking (8-bit for A100/H100) and channel-wise scaling.
Includes torch.compile() support and autotune caching for faster startup.

Maintenance & Community

Developed by Mobius Labs.
Actively under development with contributions welcome.
Twitter

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

All kernels require a minimum group_size of 32.
The GEMV RevSplit-K kernel has compatibility issues with 1-bit weights packed as 32-bit with group_size=32.
bfloat16 performance may be slightly slower for small batch sizes due to FP32 fallback atomic addition.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days