Atom by efeslab

Low-bit quantization research paper for efficient LLM serving

Created 2 years ago

333 stars

Top 82.5% on SourcePulse

Project Summary

Atom is a low-bit quantization algorithm designed to improve the efficiency and accuracy of Large Language Model (LLM) serving. It targets ML engineers and researchers working on LLM deployment, offering significant throughput gains with minimal accuracy loss by leveraging mixed-precision, group quantization, dynamic activation quantization, KV-cache quantization, and custom CUDA kernels.

How It Works

Atom employs a multi-faceted approach to low-bit quantization. It combines mixed-precision quantization, fine-grained group quantization, and dynamic activation quantization to maintain accuracy. Additionally, it incorporates KV-cache quantization and co-designs efficient CUDA kernels, building upon prior work like FlashInfer and FlashAttention, to maximize hardware utilization and throughput. This holistic strategy aims to overcome the sub-optimal performance of existing quantization schemes on modern GPUs.

Quick Start & Requirements

Installation: Clone the repository and use Docker with nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04. Set up a Conda environment (python=3.10) and install requirements.
Prerequisites: CUDA 11.3, cuDNN 8, GCC 11, CMake >= 3.24. Requires downloading LLM models from Hugging Face.
Setup: Requires compiling custom CUDA kernels.
Links: Paper, Slides, Poster

Highlighted Details

Achieves up to 7.73x higher end-to-end throughput compared to FP16 and 2.53x compared to INT8 quantization.
Supports simulated quantization for accuracy evaluation using lm_eval for perplexity and zero-shot accuracy.
Integrates code from SmoothQuant, GPTQ, and SparseGPT for result reproduction.
Evaluates end-to-end throughput and latency using the Punica serving framework.
Includes FP4 accuracy evaluation and support for Mixtral models.

Maintenance & Community

The project is associated with efeslab and has been presented at MLSys'24. The README does not provide links to community channels or a roadmap.

Licensing & Compatibility

The repository does not explicitly state a license. The inclusion of code segments from other projects suggests potential licensing considerations for commercial or closed-source use.

Limitations & Caveats

The current CUDA kernels are optimized specifically for RTX4090, with plans to optimize for different GPUs in the future. The full inference workflow in a real production scenario is still under development.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days