gemlite  by mobiusml

Triton kernels for efficient low-bit matrix multiplication

created 1 year ago
338 stars

Top 82.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

GemLite provides a collection of Triton kernels for efficient low-bit matrix multiplication, targeting developers and researchers working with large language models and other deep learning applications. It offers significant speedups for prefill and decoding operations by optimizing weight quantization and computation, aiming to simplify the integration of high-performance kernels.

How It Works

GemLite leverages Triton, a Python-based language for writing high-performance GPU kernels, to implement various matrix multiplication strategies including GEMV, GEMM, GEMM Split-K, and a novel GEMV RevSplit-K. This approach allows for flexibility in hardware optimization and bit-packing (8, 4, 2, 1-bit weights) and supports multiple activation precisions (FP16, BF16, FP8, INT8). The kernels are designed to maximize performance across different matrix shapes and batch sizes, with features like autotune caching to accelerate kernel selection.

Quick Start & Requirements

  • Installation: pip install gemlite
  • Prerequisites: CUDA-enabled GPU.
  • Usage: Import gemlite and GemLiteLinear for custom kernel instantiation, or use helper functions like A16W8 for pre-configured quantization.
  • Resources: Official Documentation

Highlighted Details

  • Achieves up to 7-8x faster prefill and 3-6x faster decoding compared to default Torch AO kernels.
  • Supports bfloat16 precision and integration with vLLM via hqq lib.
  • Features flexible bitpacking (8-bit for A100/H100) and channel-wise scaling.
  • Includes torch.compile() support and autotune caching for faster startup.

Maintenance & Community

  • Developed by Mobius Labs.
  • Actively under development with contributions welcome.
  • Twitter

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • All kernels require a minimum group_size of 32.
  • The GEMV RevSplit-K kernel has compatibility issues with 1-bit weights packed as 32-bit with group_size=32.
  • bfloat16 performance may be slightly slower for small batch sizes due to FP32 fallback atomic addition.
Health Check
Last commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
45 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 9 months ago
updated 23 hours ago
Feedback? Help us improve.