neural-compressor by intel

Python library for model compression (quantization, pruning, distillation, NAS)

Created 5 years ago

2,566 stars

Top 18.1% on SourcePulse

View on GitHub

5 Experts Love This Project

Junyang Lin

Core Maintainer at Alibaba Qwen

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Chaoyu Yang

Founder of Bento

and 1 more!

Project Summary

Intel® Neural Compressor is an open-source Python library offering state-of-the-art model compression techniques like low-bit quantization (INT8, FP8, INT4, FP4, NF4) and sparsity. It targets researchers and engineers seeking to optimize deep learning models for inference on various hardware, particularly Intel platforms, by reducing model size and accelerating execution.

How It Works

The library supports quantization, pruning, distillation, and neural architecture search across TensorFlow, PyTorch, and ONNX Runtime. It employs accuracy-driven, automatic quantization strategies, including dynamic, static, smooth, and weight-only quantization, to minimize accuracy loss while maximizing performance gains. The recent 3.x API introduces a Transformers-like interface for INT4 inference.

Quick Start & Requirements

Installation: pip install neural-compressor[pt] (for PyTorch) or pip install neural-compressor[tf] (for TensorFlow).
Prerequisites: Python 3.8+, specific framework installations (e.g., intel_extension_for_pytorch). Docker images are recommended for Intel Gaudi AI Accelerators.
Resources: Setup time varies; Gaudi examples require specific Docker images and environment setup.
Documentation: Official Documentation, LLM Recipes, Validated Models.

Highlighted Details

Supports a wide range of Intel hardware (Gaudi, Core Ultra, Xeon, Data Center GPUs) and offers limited testing on AMD CPU, ARM CPU, and NVIDIA GPU via ONNX Runtime.
Validated with numerous LLMs (Llama2, Falcon, GPT-J) and broad models (Stable Diffusion, BERT-Large, ResNet50) from hubs like Hugging Face and Torch Vision.
Integrates with cloud marketplaces (GCP, AWS, Azure) and AI ecosystems (Hugging Face, PyTorch, ONNX Runtime, Microsoft Olive).
Features a Transformers-like API for INT4 inference on Intel CPUs and GPUs.

Maintenance & Community

Actively maintained with regular releases (3.3.1 as of README).
Community engagement via GitHub Issues, email (inc.maintainers@intel.com), and a Discord Channel.

Licensing & Compatibility

Licensed under Apache 2.0.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

Compression techniques during training (QAT, Pruning, Distillation) are currently only available in the older 2.x API. Testing on non-Intel hardware is limited.

Health Check

Last Commit

11 hours ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

20 stars in the last 30 days