xFasterTransformer by intel

Optimized solution for LLM inference on X86 platforms

Created 2 years ago

435 stars

Top 68.4% on SourcePulse

Project Summary

xFasterTransformer is an optimized inference solution for large language models (LLMs) on Intel X86 CPUs, offering high performance and scalability for single and multi-socket/node deployments. It provides C++ and Python APIs, supports numerous LLM architectures, and integrates with popular serving frameworks like vLLM and FastChat.

How It Works

xFasterTransformer leverages Intel's X86 architecture capabilities, including AMX and AVX512 instruction sets, to accelerate LLM inference. It supports various quantization formats (FP16, BF16, INT8, W8A8, INT4, NF4) for efficient memory usage and computation. The library is designed for distributed inference across multiple sockets and nodes, utilizing oneCCL for communication.

Quick Start & Requirements

Installation: pip install xfastertransformer or via Docker (docker pull intel/xfastertransformer:latest).
Prerequisites: PyTorch v2.3 (for Python API), libnuma-dev (Ubuntu) or libnuma-devel (CentOS). Requires Intel CPUs with AMX and AVX512 instruction sets; not compatible with Intel Core CPUs. Linux is recommended.
Setup: Building from source involves CMake and Python setup.py. Model conversion from Huggingface format is required.
Docs: xFasterTransformer Documents and Wiki

Highlighted Details

Supports a wide range of LLMs including Llama, Qwen, ChatGLM, DeepSeek, Mixtral, and Gemma.
Offers extensive data type support: FP16, BF16, INT8, W8A8, INT4, NF4, and mixed precision formats.
Provides OpenAI-compatible API server via a vLLM fork (vllm-xft) and integration with FastChat.
Includes benchmark scripts and example code for various use cases.

Maintenance & Community

Maintained by Intel.
Contact: xft.maintainer@intel.com
Accepted papers at ICLR'2024 and ICML'2024 highlight performance optimizations.

Licensing & Compatibility

License: Apache 2.0.
Compatible with commercial and closed-source applications.

Limitations & Caveats

Does not support Intel Core CPUs due to missing AMX/AVX512 instructions.
Native Windows support is not provided; Linux is recommended.
Downgrading oneAPI to v2023.x or below is advised if issues arise with the latest oneCCL.
Model conversion is necessary, and specific transformers library versions might be required for compatibility.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days