Sparsebit  by megvii-research

Model compression and acceleration toolbox

Created 3 years ago
332 stars

Top 82.4% on SourcePulse

GitHubView on GitHub
Project Summary

Sparsebit is a PyTorch-based toolkit for model compression and acceleration, offering pruning and quantization capabilities. It targets researchers and engineers seeking to reduce model size and inference latency with minimal code changes, supporting both Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT).

How It Works

Sparsebit leverages torch.fx to transform PyTorch models into a QuantModel where operations become QuantModules. This modular design allows for easy extension of quantization methods, observers, and modules. For pruning, it supports structured and unstructured pruning across various model components (weights, activations, layers) using algorithms like L1/L0 norm, Fisher pruning, and Hrank, with ONNX export for pruned models.

Quick Start & Requirements

  • Install via pip: pip install sparsebit
  • Requires PyTorch. Specific CUDA versions or GPU hardware are not explicitly mandated for basic functionality, but advanced features like GPTQ kernels may benefit from CUDA.
  • Documentation: docs

Highlighted Details

  • Supports GPTQ CUDA kernels with group size for efficient quantization.
  • Enables fine-tuning large models like LLaMA-65b with pipeline parallelism on consumer hardware (e.g., 8x2080ti).
  • Offers PTQ and QAT examples for various architectures including LLaMA, BERT, and vision models (BEVDet, BEVDepth, ViT).
  • Supports exporting QDQ-ONNX for deployment with TensorRT and ONNXRuntime.

Maintenance & Community

The project is from megvii-research, with recent updates in April 2023. It references several open-source projects it was inspired by. Contact: sunpeiqin@megvii.com for team opportunities.

Licensing & Compatibility

Released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

While flexible, the README focuses heavily on specific model architectures and quantization techniques (e.g., GPTQ, QAT). Broader model compatibility and performance benchmarks beyond those listed may require user validation.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai), Sasha Rush Sasha Rush(Research Scientist at Cursor; Professor at Cornell Tech), and
1 more.

GPTQ-triton by fpgaminer

0%
307
Triton kernel for GPTQ inference, improving context scaling
Created 2 years ago
Updated 2 years ago
Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

AQLM by Vahe1994

0.4%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
Created 1 year ago
Updated 1 month ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

neural-compressor by intel

0.2%
2k
Python library for model compression (quantization, pruning, distillation, NAS)
Created 5 years ago
Updated 16 hours ago
Feedback? Help us improve.