Atom  by efeslab

Low-bit quantization research paper for efficient LLM serving

created 1 year ago
318 stars

Top 86.3% on sourcepulse

GitHubView on GitHub
Project Summary

Atom is a low-bit quantization algorithm designed to improve the efficiency and accuracy of Large Language Model (LLM) serving. It targets ML engineers and researchers working on LLM deployment, offering significant throughput gains with minimal accuracy loss by leveraging mixed-precision, group quantization, dynamic activation quantization, KV-cache quantization, and custom CUDA kernels.

How It Works

Atom employs a multi-faceted approach to low-bit quantization. It combines mixed-precision quantization, fine-grained group quantization, and dynamic activation quantization to maintain accuracy. Additionally, it incorporates KV-cache quantization and co-designs efficient CUDA kernels, building upon prior work like FlashInfer and FlashAttention, to maximize hardware utilization and throughput. This holistic strategy aims to overcome the sub-optimal performance of existing quantization schemes on modern GPUs.

Quick Start & Requirements

  • Installation: Clone the repository and use Docker with nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04. Set up a Conda environment (python=3.10) and install requirements.
  • Prerequisites: CUDA 11.3, cuDNN 8, GCC 11, CMake >= 3.24. Requires downloading LLM models from Hugging Face.
  • Setup: Requires compiling custom CUDA kernels.
  • Links: Paper, Slides, Poster

Highlighted Details

  • Achieves up to 7.73x higher end-to-end throughput compared to FP16 and 2.53x compared to INT8 quantization.
  • Supports simulated quantization for accuracy evaluation using lm_eval for perplexity and zero-shot accuracy.
  • Integrates code from SmoothQuant, GPTQ, and SparseGPT for result reproduction.
  • Evaluates end-to-end throughput and latency using the Punica serving framework.
  • Includes FP4 accuracy evaluation and support for Mixtral models.

Maintenance & Community

The project is associated with efeslab and has been presented at MLSys'24. The README does not provide links to community channels or a roadmap.

Licensing & Compatibility

The repository does not explicitly state a license. The inclusion of code segments from other projects suggests potential licensing considerations for commercial or closed-source use.

Limitations & Caveats

The current CUDA kernels are optimized specifically for RTX4090, with plans to optimize for different GPUs in the future. The full inference workflow in a real production scenario is still under development.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

llm-awq by mit-han-lab

0.4%
3k
Weight quantization research paper for LLM compression/acceleration
created 2 years ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Feedback? Help us improve.