gemlite  by mobiusml

Triton kernels for efficient low-bit matrix multiplication

Created 1 year ago
369 stars

Top 76.5% on SourcePulse

GitHubView on GitHub
Project Summary

GemLite provides a collection of Triton kernels for efficient low-bit matrix multiplication, targeting developers and researchers working with large language models and other deep learning applications. It offers significant speedups for prefill and decoding operations by optimizing weight quantization and computation, aiming to simplify the integration of high-performance kernels.

How It Works

GemLite leverages Triton, a Python-based language for writing high-performance GPU kernels, to implement various matrix multiplication strategies including GEMV, GEMM, GEMM Split-K, and a novel GEMV RevSplit-K. This approach allows for flexibility in hardware optimization and bit-packing (8, 4, 2, 1-bit weights) and supports multiple activation precisions (FP16, BF16, FP8, INT8). The kernels are designed to maximize performance across different matrix shapes and batch sizes, with features like autotune caching to accelerate kernel selection.

Quick Start & Requirements

  • Installation: pip install gemlite
  • Prerequisites: CUDA-enabled GPU.
  • Usage: Import gemlite and GemLiteLinear for custom kernel instantiation, or use helper functions like A16W8 for pre-configured quantization.
  • Resources: Official Documentation

Highlighted Details

  • Achieves up to 7-8x faster prefill and 3-6x faster decoding compared to default Torch AO kernels.
  • Supports bfloat16 precision and integration with vLLM via hqq lib.
  • Features flexible bitpacking (8-bit for A100/H100) and channel-wise scaling.
  • Includes torch.compile() support and autotune caching for faster startup.

Maintenance & Community

  • Developed by Mobius Labs.
  • Actively under development with contributions welcome.
  • Twitter

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • All kernels require a minimum group_size of 32.
  • The GEMV RevSplit-K kernel has compatibility issues with 1-bit weights packed as 32-bit with group_size=32.
  • bfloat16 performance may be slightly slower for small batch sizes due to FP32 fallback atomic addition.
Health Check
Last Commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
1
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
4 more.

ml-cross-entropy by apple

0.4%
520
PyTorch module for memory-efficient cross-entropy in LLMs
Created 10 months ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Jeremy Howard Jeremy Howard(Cofounder of fast.ai).

GPTFast by MDK8888

0%
687
HF Transformers accelerator for faster inference
Created 1 year ago
Updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
11 more.

Liger-Kernel by linkedin

0.6%
6k
Triton kernels for efficient LLM training
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.