auto-round  by intel

Quantization algorithm for LLMs and VLMs

created 1 year ago
559 stars

Top 58.3% on sourcepulse

GitHubView on GitHub
Project Summary

AutoRound is an advanced quantization algorithm designed to significantly reduce the memory footprint and computational cost of Large Language Models (LLMs) and Vision-Language Models (VLMs), enabling efficient inference across diverse hardware. It targets researchers and engineers seeking to deploy large models on resource-constrained environments while maintaining high accuracy, even at 2-bit precision.

How It Works

AutoRound employs a novel sign gradient descent method to fine-tune both rounding values and min-max clipping thresholds. This approach allows for rapid convergence, typically within 200 steps, to achieve state-of-the-art accuracy. The algorithm supports mixed-bit tuning, LM-head quantization, and export to popular formats like GPTQ, AWQ, and GGUF, offering flexibility in deployment.

Quick Start & Requirements

  • Installation: pip install auto-round (GPU), pip install auto-round[cpu] (CPU), pip install auto-round-lib (HPU).
  • Prerequisites: Python 3.9+, PyTorch. CUDA or specific Intel extensions may be required for hardware acceleration.
  • Usage: Command-line interface (auto-round -h) or Python API.
  • Documentation: step-by-step-instruction

Highlighted Details

  • Achieves high accuracy at 2-bit precision, with INT2-mixed R1 models retaining 97.9% accuracy.
  • Supports quantization for CPU, Intel GPU, CUDA, and HPU.
  • Integrated into Hugging Face Transformers (v4.51.3+).
  • Offers export to AutoRound, AutoGPTQ, AutoAWQ, and experimental GGUF formats.

Maintenance & Community

  • Actively developed by Intel.
  • Integrated with major libraries like Hugging Face Transformers and PyTorch AO.
  • Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license allows for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Experimental support for VLM non-text module quantization may have inference issues.
  • AutoGPTQ format has reported accuracy issues for asymmetric kernels, especially with 2-bit quantization and smaller models.
  • GGUF support is limited to q4_0 and q4_1 formats.
Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
55
Issues (30d)
26
Star History
118 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 20 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Feedback? Help us improve.