xFasterTransformer  by intel

Optimized solution for LLM inference on X86 platforms

created 2 years ago
429 stars

Top 70.2% on sourcepulse

GitHubView on GitHub
Project Summary

xFasterTransformer is an optimized inference solution for large language models (LLMs) on Intel X86 CPUs, offering high performance and scalability for single and multi-socket/node deployments. It provides C++ and Python APIs, supports numerous LLM architectures, and integrates with popular serving frameworks like vLLM and FastChat.

How It Works

xFasterTransformer leverages Intel's X86 architecture capabilities, including AMX and AVX512 instruction sets, to accelerate LLM inference. It supports various quantization formats (FP16, BF16, INT8, W8A8, INT4, NF4) for efficient memory usage and computation. The library is designed for distributed inference across multiple sockets and nodes, utilizing oneCCL for communication.

Quick Start & Requirements

  • Installation: pip install xfastertransformer or via Docker (docker pull intel/xfastertransformer:latest).
  • Prerequisites: PyTorch v2.3 (for Python API), libnuma-dev (Ubuntu) or libnuma-devel (CentOS). Requires Intel CPUs with AMX and AVX512 instruction sets; not compatible with Intel Core CPUs. Linux is recommended.
  • Setup: Building from source involves CMake and Python setup.py. Model conversion from Huggingface format is required.
  • Docs: xFasterTransformer Documents and Wiki

Highlighted Details

  • Supports a wide range of LLMs including Llama, Qwen, ChatGLM, DeepSeek, Mixtral, and Gemma.
  • Offers extensive data type support: FP16, BF16, INT8, W8A8, INT4, NF4, and mixed precision formats.
  • Provides OpenAI-compatible API server via a vLLM fork (vllm-xft) and integration with FastChat.
  • Includes benchmark scripts and example code for various use cases.

Maintenance & Community

  • Maintained by Intel.
  • Contact: xft.maintainer@intel.com
  • Accepted papers at ICLR'2024 and ICML'2024 highlight performance optimizations.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial and closed-source applications.

Limitations & Caveats

  • Does not support Intel Core CPUs due to missing AMX/AVX512 instructions.
  • Native Windows support is not provided; Linux is recommended.
  • Downgrading oneAPI to v2023.x or below is advised if issues arise with the latest oneCCL.
  • Model conversion is necessary, and specific transformers library versions might be required for compatibility.
Health Check
Last commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
8
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 15 hours ago
Feedback? Help us improve.