llm-awq  by mit-han-lab

Weight quantization research paper for LLM compression/acceleration

created 2 years ago
3,185 stars

Top 15.5% on sourcepulse

GitHubView on GitHub
Project Summary

AWQ (Activation-aware Weight Quantization) is a technique for compressing and accelerating Large Language Models (LLMs) by quantizing weights to low bit precision (INT3/4) while maintaining accuracy. It targets researchers and developers working with LLMs who need to reduce memory footprint and improve inference speed, especially on resource-constrained devices.

How It Works

AWQ employs an activation-aware approach, identifying and protecting salient weights that are crucial for model performance. It analyzes activation data to determine which weights are most sensitive to quantization errors, applying a more aggressive quantization strategy to less critical weights. This method achieves state-of-the-art accuracy for INT3/4 quantization, outperforming previous methods by preserving critical weights.

Quick Start & Requirements

  • Install: Clone the repository and install via pip: pip install -e .
  • Prerequisites: Python 3.10+, PyTorch (>=2.0.0 for edge devices, with specific NVIDIA binaries for Orin). CUDA kernel installation is required for optimized performance.
  • Model Zoo: Pre-computed AWQ search results are available via git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo.
  • Docs: Paper, Website

Highlighted Details

  • Supports instruction-tuned and multi-modal LMs (e.g., Llama-3, VILA, LLaVA).
  • Offers memory-efficient 4-bit Linear layers in PyTorch with efficient CUDA kernels.
  • Achieves state-of-the-art prefilling speed on edge devices via TinyChat.
  • Integrated into major platforms like Hugging Face Transformers, NVIDIA TensorRT-LLM, and Amazon SageMaker.

Maintenance & Community

The project is associated with the mit-han-lab at UC Berkeley and has seen significant adoption, including integration by Google Vertex AI and AMD. Updates are regularly posted, with recent additions including Llama-3 and BF16 support.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, related projects and integrations suggest broad compatibility. Users should verify licensing for specific use cases.

Limitations & Caveats

While AWQ supports a wide range of models, users may need to run the AWQ search themselves for models not included in the pre-computed model zoo. Installation on edge devices like NVIDIA Jetson Orin requires specific manual steps for PyTorch and CUDA kernel setup.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
3
Star History
216 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 11 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Feedback? Help us improve.