gemlite  by dropbox

Triton kernels for efficient low-bit matrix multiplication

Created 1 year ago
390 stars

Top 73.5% on SourcePulse

GitHubView on GitHub
Project Summary

GemLite provides a collection of Triton kernels for efficient low-bit matrix multiplication, targeting developers and researchers working with large language models and other deep learning applications. It offers significant speedups for prefill and decoding operations by optimizing weight quantization and computation, aiming to simplify the integration of high-performance kernels.

How It Works

GemLite leverages Triton, a Python-based language for writing high-performance GPU kernels, to implement various matrix multiplication strategies including GEMV, GEMM, GEMM Split-K, and a novel GEMV RevSplit-K. This approach allows for flexibility in hardware optimization and bit-packing (8, 4, 2, 1-bit weights) and supports multiple activation precisions (FP16, BF16, FP8, INT8). The kernels are designed to maximize performance across different matrix shapes and batch sizes, with features like autotune caching to accelerate kernel selection.

Quick Start & Requirements

  • Installation: pip install gemlite
  • Prerequisites: CUDA-enabled GPU.
  • Usage: Import gemlite and GemLiteLinear for custom kernel instantiation, or use helper functions like A16W8 for pre-configured quantization.
  • Resources: Official Documentation

Highlighted Details

  • Achieves up to 7-8x faster prefill and 3-6x faster decoding compared to default Torch AO kernels.
  • Supports bfloat16 precision and integration with vLLM via hqq lib.
  • Features flexible bitpacking (8-bit for A100/H100) and channel-wise scaling.
  • Includes torch.compile() support and autotune caching for faster startup.

Maintenance & Community

  • Developed by Mobius Labs.
  • Actively under development with contributions welcome.
  • Twitter

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • All kernels require a minimum group_size of 32.
  • The GEMV RevSplit-K kernel has compatibility issues with 1-bit weights packed as 32-bit with group_size=32.
  • bfloat16 performance may be slightly slower for small batch sizes due to FP32 fallback atomic addition.
Health Check
Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
4
Star History
16 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
4 more.

ml-cross-entropy by apple

0.2%
546
PyTorch module for memory-efficient cross-entropy in LLMs
Created 11 months ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Maxime Labonne Maxime Labonne(Head of Post-Training at Liquid AI), and
1 more.

GPTFast by MDK8888

0%
685
HF Transformers accelerator for faster inference
Created 1 year ago
Updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
11 more.

Liger-Kernel by linkedin

0.4%
6k
Triton kernels for efficient LLM training
Created 1 year ago
Updated 9 hours ago
Starred by Nathan Lambert Nathan Lambert(Research Scientist at AI2), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
7 more.

DeepGEMM by deepseek-ai

0.3%
6k
CUDA library for efficient FP8 GEMM kernels with fine-grained scaling
Created 8 months ago
Updated 2 weeks ago
Feedback? Help us improve.