GPU-accelerated LLM inference via quantization
Top 98.2% on sourcepulse
This project provides an efficient GPU implementation for LLM inference using six-bit (FP6) quantization, targeting researchers and engineers seeking to reduce model size and inference costs while preserving accuracy. It offers significant speedups and memory reduction compared to FP16 and INT8 baselines.
How It Works
The core innovation is the TC-FPx system design, which enables Tensor Core support for various low-bit floating-point weights. It utilizes SIMT cores for runtime dequantization of x-bit weights to FP16 before feeding them to Tensor Cores for matrix multiplication. Ahead-of-time bit-level pre-packing optimizes memory access for irregular bit-widths, and SIMT-efficient runtime minimizes dequantization overhead.
Quick Start & Requirements
pip install .
make
.Highlighted Details
bitsandbytes
and 2.6x over FP16 baselines on linear layers.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Currently, FP6-LLM is primarily tested on A100 GPUs, and while other Tensor Core GPUs are expected to be compatible, further verification may be needed. The README mentions future support for FP4 and INT5, but these are not yet implemented.
2 weeks ago
Inactive