auto-round  by intel

Quantization algorithm for LLMs and VLMs

Created 1 year ago
634 stars

Top 52.3% on SourcePulse

GitHubView on GitHub
Project Summary

AutoRound is an advanced quantization algorithm designed to significantly reduce the memory footprint and computational cost of Large Language Models (LLMs) and Vision-Language Models (VLMs), enabling efficient inference across diverse hardware. It targets researchers and engineers seeking to deploy large models on resource-constrained environments while maintaining high accuracy, even at 2-bit precision.

How It Works

AutoRound employs a novel sign gradient descent method to fine-tune both rounding values and min-max clipping thresholds. This approach allows for rapid convergence, typically within 200 steps, to achieve state-of-the-art accuracy. The algorithm supports mixed-bit tuning, LM-head quantization, and export to popular formats like GPTQ, AWQ, and GGUF, offering flexibility in deployment.

Quick Start & Requirements

  • Installation: pip install auto-round (GPU), pip install auto-round[cpu] (CPU), pip install auto-round-lib (HPU).
  • Prerequisites: Python 3.9+, PyTorch. CUDA or specific Intel extensions may be required for hardware acceleration.
  • Usage: Command-line interface (auto-round -h) or Python API.
  • Documentation: step-by-step-instruction

Highlighted Details

  • Achieves high accuracy at 2-bit precision, with INT2-mixed R1 models retaining 97.9% accuracy.
  • Supports quantization for CPU, Intel GPU, CUDA, and HPU.
  • Integrated into Hugging Face Transformers (v4.51.3+).
  • Offers export to AutoRound, AutoGPTQ, AutoAWQ, and experimental GGUF formats.

Maintenance & Community

  • Actively developed by Intel.
  • Integrated with major libraries like Hugging Face Transformers and PyTorch AO.
  • Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license allows for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Experimental support for VLM non-text module quantization may have inference issues.
  • AutoGPTQ format has reported accuracy issues for asymmetric kernels, especially with 2-bit quantization and smaller models.
  • GGUF support is limited to q4_0 and q4_1 formats.
Health Check
Last Commit

15 hours ago

Responsiveness

1 day

Pull Requests (30d)
68
Issues (30d)
40
Star History
45 stars in the last 30 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

AQLM by Vahe1994

0.4%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
Created 1 year ago
Updated 1 month ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
22 more.

qlora by artidoro

0.1%
11k
Finetuning tool for quantized LLMs
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.