Sparsebit  by megvii-research

Model compression and acceleration toolbox

created 3 years ago
331 stars

Top 83.8% on sourcepulse

GitHubView on GitHub
Project Summary

Sparsebit is a PyTorch-based toolkit for model compression and acceleration, offering pruning and quantization capabilities. It targets researchers and engineers seeking to reduce model size and inference latency with minimal code changes, supporting both Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT).

How It Works

Sparsebit leverages torch.fx to transform PyTorch models into a QuantModel where operations become QuantModules. This modular design allows for easy extension of quantization methods, observers, and modules. For pruning, it supports structured and unstructured pruning across various model components (weights, activations, layers) using algorithms like L1/L0 norm, Fisher pruning, and Hrank, with ONNX export for pruned models.

Quick Start & Requirements

  • Install via pip: pip install sparsebit
  • Requires PyTorch. Specific CUDA versions or GPU hardware are not explicitly mandated for basic functionality, but advanced features like GPTQ kernels may benefit from CUDA.
  • Documentation: docs

Highlighted Details

  • Supports GPTQ CUDA kernels with group size for efficient quantization.
  • Enables fine-tuning large models like LLaMA-65b with pipeline parallelism on consumer hardware (e.g., 8x2080ti).
  • Offers PTQ and QAT examples for various architectures including LLaMA, BERT, and vision models (BEVDet, BEVDepth, ViT).
  • Supports exporting QDQ-ONNX for deployment with TensorRT and ONNXRuntime.

Maintenance & Community

The project is from megvii-research, with recent updates in April 2023. It references several open-source projects it was inspired by. Contact: sunpeiqin@megvii.com for team opportunities.

Licensing & Compatibility

Released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

While flexible, the README focuses heavily on specific model architectures and quantization techniques (e.g., GPTQ, QAT). Broader model compatibility and performance benchmarks beyond those listed may require user validation.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Feedback? Help us improve.