AutoAWQ  by casper-hansen

AutoAWQ is a tool for 4-bit quantized LLM inference

created 1 year ago
2,222 stars

Top 20.8% on sourcepulse

GitHubView on GitHub
Project Summary

AutoAWQ provides an easy-to-use package for 4-bit quantization of Large Language Models (LLMs), significantly reducing memory requirements and accelerating inference. It implements the Activation-aware Weight Quantization (AWQ) algorithm, offering substantial speedups and memory savings compared to FP16 models, making LLMs more accessible for researchers and developers.

How It Works

AutoAWQ implements the AWQ algorithm, which quantizes LLM weights to INT4 while preserving performance by considering activation properties. It offers two quantization modes: GEMV (faster for batch size 1, less suitable for large contexts) and GEMM (faster than FP16 for batch sizes below 8, better for large contexts). The project also leverages fused modules, combining multiple layers into single operations for increased efficiency, often utilizing FasterTransformer kernels for Linux environments.

Quick Start & Requirements

  • Install: pip install autoawq or pip install autoawq[kernels] for pre-built kernels.
  • Prerequisites: NVIDIA GPUs (Compute Capability 7.5+), CUDA 11.8+. AMD ROCm support via ExLlamaV2 kernels. Intel CPU/GPU requires PyTorch 2.4+.
  • Setup: Quantization time varies (10-15 mins for 7B, ~1 hour for 70B).
  • Docs: https://github.com/casper-hansen/AutoAWQ

Highlighted Details

  • Achieves 2x-3x speedup and 3x memory reduction over FP16.
  • Supports various architectures including Mistral, Llama, Falcon, Mixtral, and more.
  • Offers CPU inference support (x86) and AMD ROCm support.
  • Provides export to GGUF and integration with Hugging Face transformers.

Maintenance & Community

The project is being deprecated and archived as its functionality is fully adopted by vLLM Compressor and MLX-LM. Notable community support includes Intel for CPU optimizations.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking would require license clarification.

Limitations & Caveats

The project is deprecated. Fused modules require fuse_layers=True, which restricts sequence length changes after model creation and is incompatible with Windows. FasterTransformer kernels are Linux-only.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
94 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 15 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Feedback? Help us improve.