xllm  by jd-opensource

LLM inference engine optimized for diverse AI accelerators

Created 3 months ago
750 stars

Top 46.3% on SourcePulse

GitHubView on GitHub
Project Summary

xllm: High-Performance LLM Inference Engine for Diverse AI Accelerators

xLLM is an efficient inference framework designed for Large Language Models (LLMs), specifically optimized for Chinese AI accelerators. It targets enterprises seeking to deploy LLMs with enhanced efficiency and reduced costs, offering a service-engine decoupled architecture that achieves breakthrough performance through advanced optimization techniques.

How It Works

The framework employs a service-engine decoupled architecture. At the service layer, it utilizes elastic scheduling and dynamic request handling. The engine layer incorporates multi-stream parallel computing, graph fusion optimization, speculative inference, dynamic load balancing, and global KV cache management. This combination accelerates inference by overlapping computation and communication, optimizing memory usage, and adapting dynamically to model shapes and workloads, particularly on supported hardware.

Quick Start & Requirements

Installation is primarily facilitated via Docker. Users can pull pre-built images (e.g., xllm/xllm-ai:xllm-0.6.0-dev-hb-rc2-py3.11-oe24.03-lts) and run containers with necessary device passthrough (--device=/dev/davinci0, etc.) and volume mounts. Alternatively, the project can be built from source by cloning the repository, initializing submodules, managing dependencies via pip, and compiling using setup.py. Key requirements include Ascend AI accelerators and vcpkg for building. Official documentation is available at https://xllm.readthedocs.io/zh-cn/latest/ and Docker images at https://hub.docker.com/r/xllm/xllm-ai.

Highlighted Details

  • Optimized Inference: Features graph pipeline execution, dynamic shape optimization, efficient memory management, and global KV cache management.
  • Algorithm Acceleration: Leverages speculative decoding and dynamic MoE expert load balancing for improved efficiency.
  • Broad Model Support: Compatible with models like DeepSeek-V3/R1, Qwen2/3, Kimi-k2, and Llama2/3.
  • Production Proven: Deployed in JD.com's core retail business across various applications.

Maintenance & Community

The project actively encourages contributions through issue reporting and pull requests. Community support is available via internal Slack channels and a WeChat user group. Several university research labs and numerous developers are acknowledged contributors.

Licensing & Compatibility

xLLM is licensed under the Apache License 2.0, which permits commercial use and modification.

Limitations & Caveats

The framework is heavily optimized for specific Chinese AI accelerators (e.g., Ascend), potentially limiting performance or compatibility on other hardware. The provided Docker image tags suggest the project may be in a development or release candidate stage. Setup requires specific hardware configurations and potentially complex Docker environment management.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
135
Issues (30d)
16
Star History
134 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
1 more.

ArcticInference by snowflakedb

3.1%
325
vLLM plugin for high-throughput, low-latency LLM and embedding inference
Created 8 months ago
Updated 4 days ago
Feedback? Help us improve.