neural-speed  by intel

Library for efficient LLM inference via low-bit quantization

Created 1 year ago
348 stars

Top 79.8% on SourcePulse

GitHubView on GitHub
Project Summary

Neural Speed is a library for efficient Large Language Model (LLM) inference on Intel platforms, leveraging low-bit quantization techniques. It targets developers and researchers seeking to optimize LLM performance on CPUs, offering significant speedups and advanced features like tensor parallelism.

How It Works

Neural Speed utilizes Intel Neural Compressor for state-of-the-art low-bit quantization (int1 to int8). It features highly optimized kernels for Intel CPUs, supporting instruction set architectures like AMX, VNNI, and AVX2. This approach aims to deliver substantial performance gains, up to 40x faster than llama.cpp, by minimizing memory bandwidth and computational requirements.

Quick Start & Requirements

  • Installation: pip install neural-speed (from source) or pip install -r requirements.txt followed by pip install neural-speed.
  • Prerequisites: GCC 10+.
  • Usage: Supports Hugging Face PyTorch models (e.g., Llama2, Mistral) and GGUF format models. Integration with Intel Extension for Transformers is available via use_neural_speed: true in configuration.
  • Resources: Requires Intel Xeon Scalable Processors, Intel Xeon CPU Max Series, or Intel Core Processors.
  • Docs: Neural Chat

Highlighted Details

  • Up to 40x performance speedup on popular LLMs compared to llama.cpp.
  • Supports N-bit weight quantization (int1-int8).
  • Tensor parallelism across CPU sockets/nodes.
  • Compatible with Hugging Face and Modelscope models.

Maintenance & Community

This project is NOT UNDER ACTIVE MANAGEMENT by Intel. Intel has ceased development, maintenance, bug fixes, and contributions. An alternative is intel/intel-extension-for-pytorch.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is no longer maintained by Intel, meaning no future updates, bug fixes, or support will be provided. The APIs are subject to change as per the README.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.