flute  by HanGuo97

Engine for LUT-quantized LLMs

created 1 year ago
372 stars

Top 77.3% on sourcepulse

GitHubView on GitHub
Project Summary

FLUTE is a flexible engine for efficient matrix multiplications tailored for Lookup Table (LUT) quantized Large Language Models (LLMs). It enables significant speedups and reduced memory footprints for LLMs by supporting various low-bit quantization schemes, including uniform, NF4, and the novel Learned Normal Float (NFL). This project is beneficial for researchers and practitioners looking to deploy LLMs with reduced resource requirements without substantial performance degradation.

How It Works

FLUTE leverages custom CUDA kernels to accelerate matrix multiplications for LUT-quantized weights. Unlike uniform quantization, LUT quantization maps quantized values to arbitrary de-quantized values via a lookup table and a scale factor, offering greater flexibility and potentially better accuracy. The project supports various bit-widths (2, 3, 4-bit) and group sizes, with ongoing development for broader compatibility.

Quick Start & Requirements

  • Installation: pip install flute-kernel (specify CUDA version with -i https://flute-ai.github.io/whl/cuXXX if needed).
  • Prerequisites: CUDA 11.8 or 12.1/12.4.
  • Integration: Supports vLLM for serving quantized models and Hugging Face Transformers (experimental).
  • Documentation: Getting Started

Highlighted Details

  • Supports flexible LUT quantization (e.g., int4, fp4, nf4, nf3, nf2) and Learned Normal Float (NFL).
  • Achieves competitive performance on LLaMA-3.1 and Gemma-2 models with minimal perplexity degradation.
  • Provides integrations for seamless deployment with vLLM and Hugging Face.
  • Includes pre-quantized models and tools for quantizing custom models.

Maintenance & Community

The project is actively developed, with recent updates including support for LLaMA-3.1 (405B), Gemma-2, Hadamard Transform, and vector quantization. Community interaction channels are not explicitly mentioned in the README.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The CUDA kernel is currently shape-specialized, requiring tuning for new model architectures or hardware configurations. Some specific configurations (e.g., bits=4, group-size=256 on A100/RTX4090 with bfloat16) have shown numerical instability or correctness issues and are not recommended. Extending to new models requires manual kernel tuning and recompilation.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

llm-awq by mit-han-lab

0.4%
3k
Weight quantization research paper for LLM compression/acceleration
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.