neural-speed  by intel

Library for efficient LLM inference via low-bit quantization

created 1 year ago
349 stars

Top 80.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Neural Speed is a library for efficient Large Language Model (LLM) inference on Intel platforms, leveraging low-bit quantization techniques. It targets developers and researchers seeking to optimize LLM performance on CPUs, offering significant speedups and advanced features like tensor parallelism.

How It Works

Neural Speed utilizes Intel Neural Compressor for state-of-the-art low-bit quantization (int1 to int8). It features highly optimized kernels for Intel CPUs, supporting instruction set architectures like AMX, VNNI, and AVX2. This approach aims to deliver substantial performance gains, up to 40x faster than llama.cpp, by minimizing memory bandwidth and computational requirements.

Quick Start & Requirements

  • Installation: pip install neural-speed (from source) or pip install -r requirements.txt followed by pip install neural-speed.
  • Prerequisites: GCC 10+.
  • Usage: Supports Hugging Face PyTorch models (e.g., Llama2, Mistral) and GGUF format models. Integration with Intel Extension for Transformers is available via use_neural_speed: true in configuration.
  • Resources: Requires Intel Xeon Scalable Processors, Intel Xeon CPU Max Series, or Intel Core Processors.
  • Docs: Neural Chat

Highlighted Details

  • Up to 40x performance speedup on popular LLMs compared to llama.cpp.
  • Supports N-bit weight quantization (int1-int8).
  • Tensor parallelism across CPU sockets/nodes.
  • Compatible with Hugging Face and Modelscope models.

Maintenance & Community

This project is NOT UNDER ACTIVE MANAGEMENT by Intel. Intel has ceased development, maintenance, bug fixes, and contributions. An alternative is intel/intel-extension-for-pytorch.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is no longer maintained by Intel, meaning no future updates, bug fixes, or support will be provided. The APIs are subject to change as per the README.

Health Check
Last commit

11 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
6 more.

FasterTransformer by NVIDIA

0.2%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Feedback? Help us improve.