flute by HanGuo97

Engine for LUT-quantized LLMs

Created 1 year ago

380 stars

Top 75.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Alex Chen

Cofounder of Nexa AI

Project Summary

FLUTE is a flexible engine for efficient matrix multiplications tailored for Lookup Table (LUT) quantized Large Language Models (LLMs). It enables significant speedups and reduced memory footprints for LLMs by supporting various low-bit quantization schemes, including uniform, NF4, and the novel Learned Normal Float (NFL). This project is beneficial for researchers and practitioners looking to deploy LLMs with reduced resource requirements without substantial performance degradation.

How It Works

FLUTE leverages custom CUDA kernels to accelerate matrix multiplications for LUT-quantized weights. Unlike uniform quantization, LUT quantization maps quantized values to arbitrary de-quantized values via a lookup table and a scale factor, offering greater flexibility and potentially better accuracy. The project supports various bit-widths (2, 3, 4-bit) and group sizes, with ongoing development for broader compatibility.

Quick Start & Requirements

Installation: pip install flute-kernel (specify CUDA version with -i https://flute-ai.github.io/whl/cuXXX if needed).
Prerequisites: CUDA 11.8 or 12.1/12.4.
Integration: Supports vLLM for serving quantized models and Hugging Face Transformers (experimental).
Documentation: Getting Started

Highlighted Details

Supports flexible LUT quantization (e.g., int4, fp4, nf4, nf3, nf2) and Learned Normal Float (NFL).
Achieves competitive performance on LLaMA-3.1 and Gemma-2 models with minimal perplexity degradation.
Provides integrations for seamless deployment with vLLM and Hugging Face.
Includes pre-quantized models and tools for quantizing custom models.

Maintenance & Community

The project is actively developed, with recent updates including support for LLaMA-3.1 (405B), Gemma-2, Hadamard Transform, and vector quantization. Community interaction channels are not explicitly mentioned in the README.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The CUDA kernel is currently shape-specialized, requiring tuning for new model architectures or hardware configurations. Some specific configurations (e.g., bits=4, group-size=256 on A100/RTX4090 with bfloat16) have shown numerical instability or correctness issues and are not recommended. Extending to new models requires manual kernel tuning and recompilation.

flute by HanGuo97

Explore Similar Projects

SpQR by Vahe1994

VPTQ by microsoft

SqueezeLLM by SqueezeAILab

quip-sharp by Cornell-RelaxML

exllamav3 by turboderp-org

LightCompress by ModelTC

OmniQuant by OpenGVLab

optimum-quanto by huggingface

deepcompressor by nunchaku-tech

gptq by IST-DASLab

AutoGPTQ by AutoGPTQ

qlora by artidoro