AutoAWQ by casper-hansen

AutoAWQ is a tool for 4-bit quantized LLM inference

Created 2 years ago

2,300 stars

Top 19.5% on SourcePulse

View on GitHub

10 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Founder of Axolotl AI

and 6 more!

Project Summary

AutoAWQ provides an easy-to-use package for 4-bit quantization of Large Language Models (LLMs), significantly reducing memory requirements and accelerating inference. It implements the Activation-aware Weight Quantization (AWQ) algorithm, offering substantial speedups and memory savings compared to FP16 models, making LLMs more accessible for researchers and developers.

How It Works

AutoAWQ implements the AWQ algorithm, which quantizes LLM weights to INT4 while preserving performance by considering activation properties. It offers two quantization modes: GEMV (faster for batch size 1, less suitable for large contexts) and GEMM (faster than FP16 for batch sizes below 8, better for large contexts). The project also leverages fused modules, combining multiple layers into single operations for increased efficiency, often utilizing FasterTransformer kernels for Linux environments.

Quick Start & Requirements

Install: pip install autoawq or pip install autoawq[kernels] for pre-built kernels.
Prerequisites: NVIDIA GPUs (Compute Capability 7.5+), CUDA 11.8+. AMD ROCm support via ExLlamaV2 kernels. Intel CPU/GPU requires PyTorch 2.4+.
Setup: Quantization time varies (10-15 mins for 7B, ~1 hour for 70B).
Docs: https://github.com/casper-hansen/AutoAWQ

Highlighted Details

Achieves 2x-3x speedup and 3x memory reduction over FP16.
Supports various architectures including Mistral, Llama, Falcon, Mixtral, and more.
Offers CPU inference support (x86) and AMD ROCm support.
Provides export to GGUF and integration with Hugging Face transformers.

Maintenance & Community

The project is being deprecated and archived as its functionality is fully adopted by vLLM Compressor and MLX-LM. Notable community support includes Intel for CPU optimizations.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking would require license clarification.

Limitations & Caveats

The project is deprecated. Fused modules require fuse_layers=True, which restricts sequence length changes after model creation and is incompatible with Windows. FasterTransformer kernels are Linux-only.

Health Check

Last Commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days