optimum-quanto  by huggingface

PyTorch quantization backend for Hugging Face models

created 1 year ago
975 stars

Top 38.7% on sourcepulse

GitHubView on GitHub
Project Summary

Optimum Quanto provides a PyTorch quantization backend for Hugging Face's Optimum library, enabling efficient model deployment through weight and activation quantization. It targets researchers and engineers working with large language models and diffusion models, offering simplified workflows for converting float models to quantized versions with minimal accuracy loss and significant memory reduction.

How It Works

Quanto introduces a Tensor subclass that projects source tensor values into an optimal range for a target data type, minimizing saturation and zeroing. For integer types, this involves rounding; for floats, it uses native PyTorch casting. This projection is symmetric per-tensor/channel for int8/float8, and group-wise affine for lower bitwidths. Quanto replaces standard PyTorch modules with quantized versions that dynamically convert weights until a model is "frozen," allowing for quantization-aware training. Weights are typically quantized per-channel, while biases are preserved as floats. Activations are quantized per-tensor using static scales, with optional calibration to determine optimal scales.

Quick Start & Requirements

  • Install via pip: pip install optimum-quanto
  • Requires PyTorch. CUDA is recommended for accelerated matrix multiplications.
  • Official documentation and examples are available.

Highlighted Details

  • Supports int2, int4, int8, and float8 weights, and int8/float8 activations.
  • Offers accelerated matrix multiplications on CUDA for various mixed precision combinations (e.g., int8-int8, fp16-int4).
  • Models quantized with int8/float8 weights and float8 activations show accuracy close to full-precision models.
  • Memory usage is reduced proportionally to the bitwidth reduction.

Maintenance & Community

  • Developed by Hugging Face.
  • Links to official documentation and examples are provided within the README.

Licensing & Compatibility

  • License is not explicitly stated in the README, but as part of Hugging Face's ecosystem, it is likely Apache 2.0 or similar, generally permissive for commercial use.

Limitations & Caveats

  • Features like dynamic activation smoothing and kernels for all mixed matrix multiplications across all devices are still under development.
  • Compatibility with torch.compile (Dynamo) is not yet implemented.
  • Quantizing activations per-tensor to int8 can lead to significant errors with outlier values, potentially mitigated by float8 activations or external smoothing techniques.
Health Check
Last commit

4 weeks ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
3
Star History
54 stars in the last 90 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Feedback? Help us improve.