Atom  by efeslab

Low-bit quantization research paper for efficient LLM serving

Created 1 year ago
320 stars

Top 84.6% on SourcePulse

GitHubView on GitHub
Project Summary

Atom is a low-bit quantization algorithm designed to improve the efficiency and accuracy of Large Language Model (LLM) serving. It targets ML engineers and researchers working on LLM deployment, offering significant throughput gains with minimal accuracy loss by leveraging mixed-precision, group quantization, dynamic activation quantization, KV-cache quantization, and custom CUDA kernels.

How It Works

Atom employs a multi-faceted approach to low-bit quantization. It combines mixed-precision quantization, fine-grained group quantization, and dynamic activation quantization to maintain accuracy. Additionally, it incorporates KV-cache quantization and co-designs efficient CUDA kernels, building upon prior work like FlashInfer and FlashAttention, to maximize hardware utilization and throughput. This holistic strategy aims to overcome the sub-optimal performance of existing quantization schemes on modern GPUs.

Quick Start & Requirements

  • Installation: Clone the repository and use Docker with nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04. Set up a Conda environment (python=3.10) and install requirements.
  • Prerequisites: CUDA 11.3, cuDNN 8, GCC 11, CMake >= 3.24. Requires downloading LLM models from Hugging Face.
  • Setup: Requires compiling custom CUDA kernels.
  • Links: Paper, Slides, Poster

Highlighted Details

  • Achieves up to 7.73x higher end-to-end throughput compared to FP16 and 2.53x compared to INT8 quantization.
  • Supports simulated quantization for accuracy evaluation using lm_eval for perplexity and zero-shot accuracy.
  • Integrates code from SmoothQuant, GPTQ, and SparseGPT for result reproduction.
  • Evaluates end-to-end throughput and latency using the Punica serving framework.
  • Includes FP4 accuracy evaluation and support for Mixtral models.

Maintenance & Community

The project is associated with efeslab and has been presented at MLSys'24. The README does not provide links to community channels or a roadmap.

Licensing & Compatibility

The repository does not explicitly state a license. The inclusion of code segments from other projects suggests potential licensing considerations for commercial or closed-source use.

Limitations & Caveats

The current CUDA kernels are optimized specifically for RTX4090, with plans to optimize for different GPUs in the future. The full inference workflow in a real production scenario is still under development.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai), Sasha Rush Sasha Rush(Research Scientist at Cursor; Professor at Cornell Tech), and
1 more.

GPTQ-triton by fpgaminer

0%
307
Triton kernel for GPTQ inference, improving context scaling
Created 2 years ago
Updated 2 years ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and Jeremy Howard Jeremy Howard(Cofounder of fast.ai).

QuaRot by spcl

0.5%
424
Code for a NeurIPS 2024 research paper on LLM quantization
Created 1 year ago
Updated 9 months ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

gptq by IST-DASLab

0.1%
2k
Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers
Created 2 years ago
Updated 1 year ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.