llm-awq by mit-han-lab

Weight quantization research paper for LLM compression/acceleration

Created 2 years ago

3,410 stars

Top 14.2% on SourcePulse

View on GitHub

9 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Jeremy Howard

Cofounder of fast.ai

Lysandre Debut

Chief Open-Source Officer at Hugging Face

and 5 more!

Project Summary

AWQ (Activation-aware Weight Quantization) is a technique for compressing and accelerating Large Language Models (LLMs) by quantizing weights to low bit precision (INT3/4) while maintaining accuracy. It targets researchers and developers working with LLMs who need to reduce memory footprint and improve inference speed, especially on resource-constrained devices.

How It Works

AWQ employs an activation-aware approach, identifying and protecting salient weights that are crucial for model performance. It analyzes activation data to determine which weights are most sensitive to quantization errors, applying a more aggressive quantization strategy to less critical weights. This method achieves state-of-the-art accuracy for INT3/4 quantization, outperforming previous methods by preserving critical weights.

Quick Start & Requirements

Install: Clone the repository and install via pip: pip install -e .
Prerequisites: Python 3.10+, PyTorch (>=2.0.0 for edge devices, with specific NVIDIA binaries for Orin). CUDA kernel installation is required for optimized performance.
Model Zoo: Pre-computed AWQ search results are available via git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo.
Docs: Paper, Website

Highlighted Details

Supports instruction-tuned and multi-modal LMs (e.g., Llama-3, VILA, LLaVA).
Offers memory-efficient 4-bit Linear layers in PyTorch with efficient CUDA kernels.
Achieves state-of-the-art prefilling speed on edge devices via TinyChat.
Integrated into major platforms like Hugging Face Transformers, NVIDIA TensorRT-LLM, and Amazon SageMaker.

Maintenance & Community

The project is associated with the mit-han-lab at UC Berkeley and has seen significant adoption, including integration by Google Vertex AI and AMD. Updates are regularly posted, with recent additions including Llama-3 and BF16 support.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, related projects and integrations suggest broad compatibility. Users should verify licensing for specific use cases.

Limitations & Caveats

While AWQ supports a wide range of models, users may need to run the AWQ search themselves for models not included in the pre-computed model zoo. Installation on edge devices like NVIDIA Jetson Orin requires specific manual steps for PyTorch and CUDA kernel setup.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

29 stars in the last 30 days