FLUTE is a flexible engine for efficient matrix multiplications tailored for Lookup Table (LUT) quantized Large Language Models (LLMs). It enables significant speedups and reduced memory footprints for LLMs by supporting various low-bit quantization schemes, including uniform, NF4, and the novel Learned Normal Float (NFL). This project is beneficial for researchers and practitioners looking to deploy LLMs with reduced resource requirements without substantial performance degradation.
How It Works
FLUTE leverages custom CUDA kernels to accelerate matrix multiplications for LUT-quantized weights. Unlike uniform quantization, LUT quantization maps quantized values to arbitrary de-quantized values via a lookup table and a scale factor, offering greater flexibility and potentially better accuracy. The project supports various bit-widths (2, 3, 4-bit) and group sizes, with ongoing development for broader compatibility.
Quick Start & Requirements
pip install flute-kernel
(specify CUDA version with -i https://flute-ai.github.io/whl/cuXXX
if needed).Highlighted Details
int4
, fp4
, nf4
, nf3
, nf2
) and Learned Normal Float (NFL).Maintenance & Community
The project is actively developed, with recent updates including support for LLaMA-3.1 (405B), Gemma-2, Hadamard Transform, and vector quantization. Community interaction channels are not explicitly mentioned in the README.
Licensing & Compatibility
The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The CUDA kernel is currently shape-specialized, requiring tuning for new model architectures or hardware configurations. Some specific configurations (e.g., bits=4, group-size=256
on A100/RTX4090 with bfloat16) have shown numerical instability or correctness issues and are not recommended. Extending to new models requires manual kernel tuning and recompilation.
3 months ago
1 day