neural-compressor  by intel

Python library for model compression (quantization, pruning, distillation, NAS)

Created 5 years ago
2,566 stars

Top 18.1% on SourcePulse

GitHubView on GitHub
Project Summary

Intel® Neural Compressor is an open-source Python library offering state-of-the-art model compression techniques like low-bit quantization (INT8, FP8, INT4, FP4, NF4) and sparsity. It targets researchers and engineers seeking to optimize deep learning models for inference on various hardware, particularly Intel platforms, by reducing model size and accelerating execution.

How It Works

The library supports quantization, pruning, distillation, and neural architecture search across TensorFlow, PyTorch, and ONNX Runtime. It employs accuracy-driven, automatic quantization strategies, including dynamic, static, smooth, and weight-only quantization, to minimize accuracy loss while maximizing performance gains. The recent 3.x API introduces a Transformers-like interface for INT4 inference.

Quick Start & Requirements

  • Installation: pip install neural-compressor[pt] (for PyTorch) or pip install neural-compressor[tf] (for TensorFlow).
  • Prerequisites: Python 3.8+, specific framework installations (e.g., intel_extension_for_pytorch). Docker images are recommended for Intel Gaudi AI Accelerators.
  • Resources: Setup time varies; Gaudi examples require specific Docker images and environment setup.
  • Documentation: Official Documentation, LLM Recipes, Validated Models.

Highlighted Details

  • Supports a wide range of Intel hardware (Gaudi, Core Ultra, Xeon, Data Center GPUs) and offers limited testing on AMD CPU, ARM CPU, and NVIDIA GPU via ONNX Runtime.
  • Validated with numerous LLMs (Llama2, Falcon, GPT-J) and broad models (Stable Diffusion, BERT-Large, ResNet50) from hubs like Hugging Face and Torch Vision.
  • Integrates with cloud marketplaces (GCP, AWS, Azure) and AI ecosystems (Hugging Face, PyTorch, ONNX Runtime, Microsoft Olive).
  • Features a Transformers-like API for INT4 inference on Intel CPUs and GPUs.

Maintenance & Community

  • Actively maintained with regular releases (3.3.1 as of README).
  • Community engagement via GitHub Issues, email (inc.maintainers@intel.com), and a Discord Channel.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Compression techniques during training (QAT, Pruning, Distillation) are currently only available in the older 2.x API. Testing on non-Intel hardware is limited.
Health Check
Last Commit

11 hours ago

Responsiveness

1 day

Pull Requests (30d)
25
Issues (30d)
1
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Dan Guido Dan Guido(Cofounder of Trail of Bits), and
6 more.

llm-compressor by vllm-project

1.6%
3k
Transformers-compatible library for LLM compression, optimized for vLLM deployment
Created 1 year ago
Updated 14 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.1%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 7 months ago
Feedback? Help us improve.