fp6_llm  by usyd-fsalab

GPU-accelerated LLM inference via quantization

created 1 year ago
260 stars

Top 98.2% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an efficient GPU implementation for LLM inference using six-bit (FP6) quantization, targeting researchers and engineers seeking to reduce model size and inference costs while preserving accuracy. It offers significant speedups and memory reduction compared to FP16 and INT8 baselines.

How It Works

The core innovation is the TC-FPx system design, which enables Tensor Core support for various low-bit floating-point weights. It utilizes SIMT cores for runtime dequantization of x-bit weights to FP16 before feeding them to Tensor Cores for matrix multiplication. Ahead-of-time bit-level pre-packing optimizes memory access for irregular bit-widths, and SIMT-efficient runtime minimizes dequantization overhead.

Quick Start & Requirements

  • Install via pip install .
  • Requires PyTorch and CUDA.
  • C++ API requires compilation via make.
  • Tested on NVIDIA A100 GPUs; H100 and GH200 compatibility is expected.

Highlighted Details

  • Achieves up to 8.9x speedup over bitsandbytes and 2.6x over FP16 baselines on linear layers.
  • Near-lossless model compression with FP6, outperforming INT4 in quality.
  • End-to-end inference on LLaMA-70b shows 1.69x-2.65x higher throughput, requiring fewer GPUs.
  • Supports FP6_e3m2 and FP5_e2m2 weights with FP16 activations.

Maintenance & Community

  • Project name recently changed back to QuantLLM.
  • Paper accepted by USENIX ATC24.
  • Integrated into DeepSpeed.
  • Welcomes collaborations and community contributions.

Licensing & Compatibility

  • No explicit license mentioned in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Currently, FP6-LLM is primarily tested on A100 GPUs, and while other Tensor Core GPUs are expected to be compatible, further verification may be needed. The README mentions future support for FP4 and INT5, but these are not yet implemented.

Health Check
Last commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
6 more.

FasterTransformer by NVIDIA

0.2%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Feedback? Help us improve.