Optimized solution for LLM inference on X86 platforms
Top 70.2% on sourcepulse
xFasterTransformer is an optimized inference solution for large language models (LLMs) on Intel X86 CPUs, offering high performance and scalability for single and multi-socket/node deployments. It provides C++ and Python APIs, supports numerous LLM architectures, and integrates with popular serving frameworks like vLLM and FastChat.
How It Works
xFasterTransformer leverages Intel's X86 architecture capabilities, including AMX and AVX512 instruction sets, to accelerate LLM inference. It supports various quantization formats (FP16, BF16, INT8, W8A8, INT4, NF4) for efficient memory usage and computation. The library is designed for distributed inference across multiple sockets and nodes, utilizing oneCCL for communication.
Quick Start & Requirements
pip install xfastertransformer
or via Docker (docker pull intel/xfastertransformer:latest
).Highlighted Details
vllm-xft
) and integration with FastChat.Maintenance & Community
xft.maintainer@intel.com
Licensing & Compatibility
Limitations & Caveats
transformers
library versions might be required for compatibility.1 week ago
1 week