auto-round by intel

Quantization algorithm for LLMs and VLMs

Created 2 years ago

806 stars

Top 43.9% on SourcePulse

View on GitHub

3 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Yaowei Zheng

Author of LLaMA-Factory

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

AutoRound is an advanced quantization algorithm designed to significantly reduce the memory footprint and computational cost of Large Language Models (LLMs) and Vision-Language Models (VLMs), enabling efficient inference across diverse hardware. It targets researchers and engineers seeking to deploy large models on resource-constrained environments while maintaining high accuracy, even at 2-bit precision.

How It Works

AutoRound employs a novel sign gradient descent method to fine-tune both rounding values and min-max clipping thresholds. This approach allows for rapid convergence, typically within 200 steps, to achieve state-of-the-art accuracy. The algorithm supports mixed-bit tuning, LM-head quantization, and export to popular formats like GPTQ, AWQ, and GGUF, offering flexibility in deployment.

Quick Start & Requirements

Installation: pip install auto-round (GPU), pip install auto-round[cpu] (CPU), pip install auto-round-lib (HPU).
Prerequisites: Python 3.9+, PyTorch. CUDA or specific Intel extensions may be required for hardware acceleration.
Usage: Command-line interface (auto-round -h) or Python API.
Documentation: step-by-step-instruction

Highlighted Details

Achieves high accuracy at 2-bit precision, with INT2-mixed R1 models retaining 97.9% accuracy.
Supports quantization for CPU, Intel GPU, CUDA, and HPU.
Integrated into Hugging Face Transformers (v4.51.3+).
Offers export to AutoRound, AutoGPTQ, AutoAWQ, and experimental GGUF formats.

Maintenance & Community

Actively developed by Intel.
Integrated with major libraries like Hugging Face Transformers and PyTorch AO.
Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Permissive license allows for commercial use and integration with closed-source projects.

Limitations & Caveats

Experimental support for VLM non-text module quantization may have inference issues.
AutoGPTQ format has reported accuracy issues for asymmetric kernels, especially with 2-bit quantization and smaller models.
GGUF support is limited to q4_0 and q4_1 formats.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

45 stars in the last 30 days