Triton kernels for efficient low-bit matrix multiplication
Top 82.6% on sourcepulse
GemLite provides a collection of Triton kernels for efficient low-bit matrix multiplication, targeting developers and researchers working with large language models and other deep learning applications. It offers significant speedups for prefill and decoding operations by optimizing weight quantization and computation, aiming to simplify the integration of high-performance kernels.
How It Works
GemLite leverages Triton, a Python-based language for writing high-performance GPU kernels, to implement various matrix multiplication strategies including GEMV, GEMM, GEMM Split-K, and a novel GEMV RevSplit-K. This approach allows for flexibility in hardware optimization and bit-packing (8, 4, 2, 1-bit weights) and supports multiple activation precisions (FP16, BF16, FP8, INT8). The kernels are designed to maximize performance across different matrix shapes and batch sizes, with features like autotune caching to accelerate kernel selection.
Quick Start & Requirements
pip install gemlite
gemlite
and GemLiteLinear
for custom kernel instantiation, or use helper functions like A16W8
for pre-configured quantization.Highlighted Details
torch.compile()
support and autotune caching for faster startup.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
group_size
of 32.group_size=32
.5 days ago
1 day