PyTorch quantization backend for Hugging Face models
Top 38.7% on sourcepulse
Optimum Quanto provides a PyTorch quantization backend for Hugging Face's Optimum library, enabling efficient model deployment through weight and activation quantization. It targets researchers and engineers working with large language models and diffusion models, offering simplified workflows for converting float models to quantized versions with minimal accuracy loss and significant memory reduction.
How It Works
Quanto introduces a Tensor
subclass that projects source tensor values into an optimal range for a target data type, minimizing saturation and zeroing. For integer types, this involves rounding; for floats, it uses native PyTorch casting. This projection is symmetric per-tensor/channel for int8/float8, and group-wise affine for lower bitwidths. Quanto replaces standard PyTorch modules with quantized versions that dynamically convert weights until a model is "frozen," allowing for quantization-aware training. Weights are typically quantized per-channel, while biases are preserved as floats. Activations are quantized per-tensor using static scales, with optional calibration to determine optimal scales.
Quick Start & Requirements
pip install optimum-quanto
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
torch.compile
(Dynamo) is not yet implemented.4 weeks ago
1 day