neural-compressor  by intel

Python library for model compression (quantization, pruning, distillation, NAS)

Created 5 years ago
2,492 stars

Top 18.7% on SourcePulse

GitHubView on GitHub
Project Summary

Intel® Neural Compressor is an open-source Python library offering state-of-the-art model compression techniques like low-bit quantization (INT8, FP8, INT4, FP4, NF4) and sparsity. It targets researchers and engineers seeking to optimize deep learning models for inference on various hardware, particularly Intel platforms, by reducing model size and accelerating execution.

How It Works

The library supports quantization, pruning, distillation, and neural architecture search across TensorFlow, PyTorch, and ONNX Runtime. It employs accuracy-driven, automatic quantization strategies, including dynamic, static, smooth, and weight-only quantization, to minimize accuracy loss while maximizing performance gains. The recent 3.x API introduces a Transformers-like interface for INT4 inference.

Quick Start & Requirements

  • Installation: pip install neural-compressor[pt] (for PyTorch) or pip install neural-compressor[tf] (for TensorFlow).
  • Prerequisites: Python 3.8+, specific framework installations (e.g., intel_extension_for_pytorch). Docker images are recommended for Intel Gaudi AI Accelerators.
  • Resources: Setup time varies; Gaudi examples require specific Docker images and environment setup.
  • Documentation: Official Documentation, LLM Recipes, Validated Models.

Highlighted Details

  • Supports a wide range of Intel hardware (Gaudi, Core Ultra, Xeon, Data Center GPUs) and offers limited testing on AMD CPU, ARM CPU, and NVIDIA GPU via ONNX Runtime.
  • Validated with numerous LLMs (Llama2, Falcon, GPT-J) and broad models (Stable Diffusion, BERT-Large, ResNet50) from hubs like Hugging Face and Torch Vision.
  • Integrates with cloud marketplaces (GCP, AWS, Azure) and AI ecosystems (Hugging Face, PyTorch, ONNX Runtime, Microsoft Olive).
  • Features a Transformers-like API for INT4 inference on Intel CPUs and GPUs.

Maintenance & Community

  • Actively maintained with regular releases (3.3.1 as of README).
  • Community engagement via GitHub Issues, email (inc.maintainers@intel.com), and a Discord Channel.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Compression techniques during training (QAT, Pruning, Distillation) are currently only available in the older 2.x API. Testing on non-Intel hardware is limited.
Health Check
Last Commit

15 hours ago

Responsiveness

1 week

Pull Requests (30d)
27
Issues (30d)
1
Star History
28 stars in the last 30 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

AQLM by Vahe1994

0.4%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
Created 1 year ago
Updated 1 month ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.2%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.