fp6_llm by usyd-fsalab

GPU-accelerated LLM inference via quantization

Created 1 year ago

276 stars

Top 93.9% on SourcePulse

Project Summary

This project provides an efficient GPU implementation for LLM inference using six-bit (FP6) quantization, targeting researchers and engineers seeking to reduce model size and inference costs while preserving accuracy. It offers significant speedups and memory reduction compared to FP16 and INT8 baselines.

How It Works

The core innovation is the TC-FPx system design, which enables Tensor Core support for various low-bit floating-point weights. It utilizes SIMT cores for runtime dequantization of x-bit weights to FP16 before feeding them to Tensor Cores for matrix multiplication. Ahead-of-time bit-level pre-packing optimizes memory access for irregular bit-widths, and SIMT-efficient runtime minimizes dequantization overhead.

Quick Start & Requirements

Install via pip install .
Requires PyTorch and CUDA.
C++ API requires compilation via make.
Tested on NVIDIA A100 GPUs; H100 and GH200 compatibility is expected.

Highlighted Details

Achieves up to 8.9x speedup over bitsandbytes and 2.6x over FP16 baselines on linear layers.
Near-lossless model compression with FP6, outperforming INT4 in quality.
End-to-end inference on LLaMA-70b shows 1.69x-2.65x higher throughput, requiring fewer GPUs.
Supports FP6_e3m2 and FP5_e2m2 weights with FP16 activations.

Maintenance & Community

Project name recently changed back to QuantLLM.
Paper accepted by USENIX ATC24.
Integrated into DeepSpeed.
Welcomes collaborations and community contributions.

Licensing & Compatibility

No explicit license mentioned in the README.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Currently, FP6-LLM is primarily tested on A100 GPUs, and while other Tensor Core GPUs are expected to be compatible, further verification may be needed. The README mentions future support for FP4 and INT5, but these are not yet implemented.

Health Check

Last Commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)

0

Issues (30d)

0

Star History

5 stars in the last 30 days

Explore Similar Projects

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI) and

Zhuohan Li

Zhuohan Li(Coauthor of vLLM).

calm by zeux

Single-GPU inference engine for rapid LLM prototyping

Created 2 years ago

Updated 7 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

BitBLAS by microsoft

Library for mixed-precision matrix multiplications, targeting quantized LLM deployment

Created 1 year ago

Updated 5 months ago

Starred by

Victor Taelin

Victor Taelin(Author of Bend, Kind, HVM).

GPU-Benchmarks-on-LLM-Inference by XiongjieDai

GPU benchmark for LLM inference using llama.cpp

Created 2 years ago

Updated 1 year ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai) and

Zhuohan Li

Zhuohan Li(Coauthor of vLLM).

marlin by IST-DASLab

FP16xINT4 kernel for fast LLM inference

Created 2 years ago

Updated 1 year ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Travis Addair

Travis Addair(Cofounder of Predibase), and

2 more.

optimum-nvidia by huggingface

SDK for optimized inference on NVIDIA hardware

Created 2 years ago

Updated 11 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

LLM inference engine for diverse applications

Created 2 years ago

Updated 13 hours ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Ying Sheng

Ying Sheng(Coauthor of SGLang).

GPTQModel by ModelCloud

LLM compression toolkit for accelerated CPU/GPU inference

Created 1 year ago

Updated 2 days ago

ik_llama.cpp by ikawrakow

`llama.cpp` fork for improved CPU/GPU performance

Created 1 year ago

Updated 1 day ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face),

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI), and

5 more.

AQLM by Vahe1994

PyTorch code for LLM compression via Additive Quantization (AQLM)

Created 2 years ago

Updated 5 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Ying Sheng

Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

High-performance C++ LLM inference library

Created 2 years ago

Updated 1 month ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face),

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind), and

8 more.

PowerInfer by SJTU-IPADS

LLM inference engine for local deployment on consumer GPUs

Created 2 years ago

Updated 5 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI) and

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

Inference optimization for LLMs on low-resource hardware

Created 2 years ago

Updated 4 months ago

Feedback? Help us improve.