Library for efficient LLM inference via low-bit quantization
Top 80.8% on sourcepulse
Neural Speed is a library for efficient Large Language Model (LLM) inference on Intel platforms, leveraging low-bit quantization techniques. It targets developers and researchers seeking to optimize LLM performance on CPUs, offering significant speedups and advanced features like tensor parallelism.
How It Works
Neural Speed utilizes Intel Neural Compressor for state-of-the-art low-bit quantization (int1 to int8). It features highly optimized kernels for Intel CPUs, supporting instruction set architectures like AMX, VNNI, and AVX2. This approach aims to deliver substantial performance gains, up to 40x faster than llama.cpp, by minimizing memory bandwidth and computational requirements.
Quick Start & Requirements
pip install neural-speed
(from source) or pip install -r requirements.txt
followed by pip install neural-speed
.use_neural_speed: true
in configuration.Highlighted Details
Maintenance & Community
This project is NOT UNDER ACTIVE MANAGEMENT by Intel. Intel has ceased development, maintenance, bug fixes, and contributions. An alternative is intel/intel-extension-for-pytorch.
Licensing & Compatibility
The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is no longer maintained by Intel, meaning no future updates, bug fixes, or support will be provided. The APIs are subject to change as per the README.
11 months ago
1 week