xllm by jd-opensource

LLM inference engine optimized for diverse AI accelerators

Created 3 months ago

750 stars

Top 46.3% on SourcePulse

Project Summary

xllm: High-Performance LLM Inference Engine for Diverse AI Accelerators

xLLM is an efficient inference framework designed for Large Language Models (LLMs), specifically optimized for Chinese AI accelerators. It targets enterprises seeking to deploy LLMs with enhanced efficiency and reduced costs, offering a service-engine decoupled architecture that achieves breakthrough performance through advanced optimization techniques.

How It Works

The framework employs a service-engine decoupled architecture. At the service layer, it utilizes elastic scheduling and dynamic request handling. The engine layer incorporates multi-stream parallel computing, graph fusion optimization, speculative inference, dynamic load balancing, and global KV cache management. This combination accelerates inference by overlapping computation and communication, optimizing memory usage, and adapting dynamically to model shapes and workloads, particularly on supported hardware.

Quick Start & Requirements

Installation is primarily facilitated via Docker. Users can pull pre-built images (e.g., xllm/xllm-ai:xllm-0.6.0-dev-hb-rc2-py3.11-oe24.03-lts) and run containers with necessary device passthrough (--device=/dev/davinci0, etc.) and volume mounts. Alternatively, the project can be built from source by cloning the repository, initializing submodules, managing dependencies via pip, and compiling using setup.py. Key requirements include Ascend AI accelerators and vcpkg for building. Official documentation is available at https://xllm.readthedocs.io/zh-cn/latest/ and Docker images at https://hub.docker.com/r/xllm/xllm-ai.

Highlighted Details

Optimized Inference: Features graph pipeline execution, dynamic shape optimization, efficient memory management, and global KV cache management.
Algorithm Acceleration: Leverages speculative decoding and dynamic MoE expert load balancing for improved efficiency.
Broad Model Support: Compatible with models like DeepSeek-V3/R1, Qwen2/3, Kimi-k2, and Llama2/3.
Production Proven: Deployed in JD.com's core retail business across various applications.

Maintenance & Community

The project actively encourages contributions through issue reporting and pull requests. Community support is available via internal Slack channels and a WeChat user group. Several university research labs and numerous developers are acknowledged contributors.

Licensing & Compatibility

xLLM is licensed under the Apache License 2.0, which permits commercial use and modification.

Limitations & Caveats

The framework is heavily optimized for specific Chinese AI accelerators (e.g., Ascend), potentially limiting performance or compatibility on other hardware. The provided Docker image tags suggest the project may be in a development or release candidate stage. Setup requires specific hardware configurations and potentially complex Docker environment management.

xllm by jd-opensource

Explore Similar Projects

TileRT by tile-ai

MoE-Infinity by EfficientMoE

varuna by microsoft

ArcticInference by snowflakedb

glake by antgroup

omniserve by mit-han-lab

mixture_of_recursions by raymin0223

marlin by IST-DASLab

chitu by thu-pacman

CTranslate2 by OpenNMT

PowerInfer by SJTU-IPADS

dynamo by ai-dynamo