AutoAWQ  by casper-hansen

AutoAWQ is a tool for 4-bit quantized LLM inference

Created 2 years ago
2,249 stars

Top 20.2% on SourcePulse

GitHubView on GitHub
Project Summary

AutoAWQ provides an easy-to-use package for 4-bit quantization of Large Language Models (LLMs), significantly reducing memory requirements and accelerating inference. It implements the Activation-aware Weight Quantization (AWQ) algorithm, offering substantial speedups and memory savings compared to FP16 models, making LLMs more accessible for researchers and developers.

How It Works

AutoAWQ implements the AWQ algorithm, which quantizes LLM weights to INT4 while preserving performance by considering activation properties. It offers two quantization modes: GEMV (faster for batch size 1, less suitable for large contexts) and GEMM (faster than FP16 for batch sizes below 8, better for large contexts). The project also leverages fused modules, combining multiple layers into single operations for increased efficiency, often utilizing FasterTransformer kernels for Linux environments.

Quick Start & Requirements

  • Install: pip install autoawq or pip install autoawq[kernels] for pre-built kernels.
  • Prerequisites: NVIDIA GPUs (Compute Capability 7.5+), CUDA 11.8+. AMD ROCm support via ExLlamaV2 kernels. Intel CPU/GPU requires PyTorch 2.4+.
  • Setup: Quantization time varies (10-15 mins for 7B, ~1 hour for 70B).
  • Docs: https://github.com/casper-hansen/AutoAWQ

Highlighted Details

  • Achieves 2x-3x speedup and 3x memory reduction over FP16.
  • Supports various architectures including Mistral, Llama, Falcon, Mixtral, and more.
  • Offers CPU inference support (x86) and AMD ROCm support.
  • Provides export to GGUF and integration with Hugging Face transformers.

Maintenance & Community

The project is being deprecated and archived as its functionality is fully adopted by vLLM Compressor and MLX-LM. Notable community support includes Intel for CPU optimizations.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking would require license clarification.

Limitations & Caveats

The project is deprecated. Fused modules require fuse_layers=True, which restricts sequence length changes after model creation and is incompatible with Windows. FasterTransformer kernels are Linux-only.

Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
25 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
36 more.

unsloth by unslothai

0.6%
46k
Finetuning tool for LLMs, targeting speed and memory efficiency
Created 1 year ago
Updated 12 hours ago
Feedback? Help us improve.