neural-speed by intel

Library for efficient LLM inference via low-bit quantization

Created 2 years ago

351 stars

Top 79.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Jeffrey Morgan

Cofounder of Ollama

Project Summary

Neural Speed is a library for efficient Large Language Model (LLM) inference on Intel platforms, leveraging low-bit quantization techniques. It targets developers and researchers seeking to optimize LLM performance on CPUs, offering significant speedups and advanced features like tensor parallelism.

How It Works

Neural Speed utilizes Intel Neural Compressor for state-of-the-art low-bit quantization (int1 to int8). It features highly optimized kernels for Intel CPUs, supporting instruction set architectures like AMX, VNNI, and AVX2. This approach aims to deliver substantial performance gains, up to 40x faster than llama.cpp, by minimizing memory bandwidth and computational requirements.

Quick Start & Requirements

Installation: pip install neural-speed (from source) or pip install -r requirements.txt followed by pip install neural-speed.
Prerequisites: GCC 10+.
Usage: Supports Hugging Face PyTorch models (e.g., Llama2, Mistral) and GGUF format models. Integration with Intel Extension for Transformers is available via use_neural_speed: true in configuration.
Resources: Requires Intel Xeon Scalable Processors, Intel Xeon CPU Max Series, or Intel Core Processors.
Docs: Neural Chat

Highlighted Details

Up to 40x performance speedup on popular LLMs compared to llama.cpp.
Supports N-bit weight quantization (int1-int8).
Tensor parallelism across CPU sockets/nodes.
Compatible with Hugging Face and Modelscope models.

Maintenance & Community

This project is NOT UNDER ACTIVE MANAGEMENT by Intel. Intel has ceased development, maintenance, bug fixes, and contributions. An alternative is intel/intel-extension-for-pytorch.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is no longer maintained by Intel, meaning no future updates, bug fixes, or support will be provided. The APIs are subject to change as per the README.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days