optimum-quanto by huggingface

PyTorch quantization backend for Hugging Face models

Created 2 years ago

1,020 stars

Top 36.7% on SourcePulse

View on GitHub

6 Experts Love This Project

Jeremy Howard

Cofounder of fast.ai

Benjamin Bolte

Cofounder of K-Scale Labs

Omar Sanseviero

DevRel at Google DeepMind

Thomas Wolf

Cofounder of Hugging Face

and 2 more!

Project Summary

Optimum Quanto provides a PyTorch quantization backend for Hugging Face's Optimum library, enabling efficient model deployment through weight and activation quantization. It targets researchers and engineers working with large language models and diffusion models, offering simplified workflows for converting float models to quantized versions with minimal accuracy loss and significant memory reduction.

How It Works

Quanto introduces a Tensor subclass that projects source tensor values into an optimal range for a target data type, minimizing saturation and zeroing. For integer types, this involves rounding; for floats, it uses native PyTorch casting. This projection is symmetric per-tensor/channel for int8/float8, and group-wise affine for lower bitwidths. Quanto replaces standard PyTorch modules with quantized versions that dynamically convert weights until a model is "frozen," allowing for quantization-aware training. Weights are typically quantized per-channel, while biases are preserved as floats. Activations are quantized per-tensor using static scales, with optional calibration to determine optimal scales.

Quick Start & Requirements

Install via pip: pip install optimum-quanto
Requires PyTorch. CUDA is recommended for accelerated matrix multiplications.
Official documentation and examples are available.

Highlighted Details

Supports int2, int4, int8, and float8 weights, and int8/float8 activations.
Offers accelerated matrix multiplications on CUDA for various mixed precision combinations (e.g., int8-int8, fp16-int4).
Models quantized with int8/float8 weights and float8 activations show accuracy close to full-precision models.
Memory usage is reduced proportionally to the bitwidth reduction.

Maintenance & Community

Developed by Hugging Face.
Links to official documentation and examples are provided within the README.

Licensing & Compatibility

License is not explicitly stated in the README, but as part of Hugging Face's ecosystem, it is likely Apache 2.0 or similar, generally permissive for commercial use.

Limitations & Caveats

Features like dynamic activation smoothing and kernels for all mixed matrix multiplications across all devices are still under development.
Compatibility with torch.compile (Dynamo) is not yet implemented.
Quantizing activations per-tensor to int8 can lead to significant errors with outlier values, potentially mitigated by float8 activations or external smoothing techniques.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days