Discover and explore top open-source AI tools and projects—updated daily.
huggingfacePyTorch quantization backend for Hugging Face models
Top 37.2% on SourcePulse
Optimum Quanto provides a PyTorch quantization backend for Hugging Face's Optimum library, enabling efficient model deployment through weight and activation quantization. It targets researchers and engineers working with large language models and diffusion models, offering simplified workflows for converting float models to quantized versions with minimal accuracy loss and significant memory reduction.
How It Works
Quanto introduces a Tensor subclass that projects source tensor values into an optimal range for a target data type, minimizing saturation and zeroing. For integer types, this involves rounding; for floats, it uses native PyTorch casting. This projection is symmetric per-tensor/channel for int8/float8, and group-wise affine for lower bitwidths. Quanto replaces standard PyTorch modules with quantized versions that dynamically convert weights until a model is "frozen," allowing for quantization-aware training. Weights are typically quantized per-channel, while biases are preserved as floats. Activations are quantized per-tensor using static scales, with optional calibration to determine optimal scales.
Quick Start & Requirements
pip install optimum-quantoHighlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
torch.compile (Dynamo) is not yet implemented.1 week ago
1 day
fpgaminer
Cornell-RelaxML
dropbox
Vahe1994
vllm-project