neural-compressor  by intel

Python library for model compression (quantization, pruning, distillation, NAS)

created 5 years ago
2,464 stars

Top 19.3% on sourcepulse

GitHubView on GitHub
Project Summary

Intel® Neural Compressor is an open-source Python library offering state-of-the-art model compression techniques like low-bit quantization (INT8, FP8, INT4, FP4, NF4) and sparsity. It targets researchers and engineers seeking to optimize deep learning models for inference on various hardware, particularly Intel platforms, by reducing model size and accelerating execution.

How It Works

The library supports quantization, pruning, distillation, and neural architecture search across TensorFlow, PyTorch, and ONNX Runtime. It employs accuracy-driven, automatic quantization strategies, including dynamic, static, smooth, and weight-only quantization, to minimize accuracy loss while maximizing performance gains. The recent 3.x API introduces a Transformers-like interface for INT4 inference.

Quick Start & Requirements

  • Installation: pip install neural-compressor[pt] (for PyTorch) or pip install neural-compressor[tf] (for TensorFlow).
  • Prerequisites: Python 3.8+, specific framework installations (e.g., intel_extension_for_pytorch). Docker images are recommended for Intel Gaudi AI Accelerators.
  • Resources: Setup time varies; Gaudi examples require specific Docker images and environment setup.
  • Documentation: Official Documentation, LLM Recipes, Validated Models.

Highlighted Details

  • Supports a wide range of Intel hardware (Gaudi, Core Ultra, Xeon, Data Center GPUs) and offers limited testing on AMD CPU, ARM CPU, and NVIDIA GPU via ONNX Runtime.
  • Validated with numerous LLMs (Llama2, Falcon, GPT-J) and broad models (Stable Diffusion, BERT-Large, ResNet50) from hubs like Hugging Face and Torch Vision.
  • Integrates with cloud marketplaces (GCP, AWS, Azure) and AI ecosystems (Hugging Face, PyTorch, ONNX Runtime, Microsoft Olive).
  • Features a Transformers-like API for INT4 inference on Intel CPUs and GPUs.

Maintenance & Community

  • Actively maintained with regular releases (3.3.1 as of README).
  • Community engagement via GitHub Issues, email (inc.maintainers@intel.com), and a Discord Channel.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Compression techniques during training (QAT, Pruning, Distillation) are currently only available in the older 2.x API. Testing on non-Intel hardware is limited.
Health Check
Last commit

3 days ago

Responsiveness

1 week

Pull Requests (30d)
32
Issues (30d)
0
Star History
85 stars in the last 90 days

Explore Similar Projects

Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
12 more.

DeepSpeed by deepspeedai

0.2%
40k
Deep learning optimization library for distributed training and inference
created 5 years ago
updated 1 day ago
Feedback? Help us improve.