xllm  by jd-opensource

LLM inference engine optimized for diverse AI accelerators

Created 2 months ago
546 stars

Top 58.6% on SourcePulse

GitHubView on GitHub
Project Summary

xllm: High-Performance LLM Inference Engine for Diverse AI Accelerators

xLLM is an efficient inference framework designed for Large Language Models (LLMs), specifically optimized for Chinese AI accelerators. It targets enterprises seeking to deploy LLMs with enhanced efficiency and reduced costs, offering a service-engine decoupled architecture that achieves breakthrough performance through advanced optimization techniques.

How It Works

The framework employs a service-engine decoupled architecture. At the service layer, it utilizes elastic scheduling and dynamic request handling. The engine layer incorporates multi-stream parallel computing, graph fusion optimization, speculative inference, dynamic load balancing, and global KV cache management. This combination accelerates inference by overlapping computation and communication, optimizing memory usage, and adapting dynamically to model shapes and workloads, particularly on supported hardware.

Quick Start & Requirements

Installation is primarily facilitated via Docker. Users can pull pre-built images (e.g., xllm/xllm-ai:xllm-0.6.0-dev-hb-rc2-py3.11-oe24.03-lts) and run containers with necessary device passthrough (--device=/dev/davinci0, etc.) and volume mounts. Alternatively, the project can be built from source by cloning the repository, initializing submodules, managing dependencies via pip, and compiling using setup.py. Key requirements include Ascend AI accelerators and vcpkg for building. Official documentation is available at https://xllm.readthedocs.io/zh-cn/latest/ and Docker images at https://hub.docker.com/r/xllm/xllm-ai.

Highlighted Details

  • Optimized Inference: Features graph pipeline execution, dynamic shape optimization, efficient memory management, and global KV cache management.
  • Algorithm Acceleration: Leverages speculative decoding and dynamic MoE expert load balancing for improved efficiency.
  • Broad Model Support: Compatible with models like DeepSeek-V3/R1, Qwen2/3, Kimi-k2, and Llama2/3.
  • Production Proven: Deployed in JD.com's core retail business across various applications.

Maintenance & Community

The project actively encourages contributions through issue reporting and pull requests. Community support is available via internal Slack channels and a WeChat user group. Several university research labs and numerous developers are acknowledged contributors.

Licensing & Compatibility

xLLM is licensed under the Apache License 2.0, which permits commercial use and modification.

Limitations & Caveats

The framework is heavily optimized for specific Chinese AI accelerators (e.g., Ascend), potentially limiting performance or compatibility on other hardware. The provided Docker image tags suggest the project may be in a development or release candidate stage. Setup requires specific hardware configurations and potentially complex Docker environment management.

Health Check
Last Commit

8 hours ago

Responsiveness

Inactive

Pull Requests (30d)
81
Issues (30d)
23
Star History
321 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
1 more.

ArcticInference by snowflakedb

2.9%
278
vLLM plugin for high-throughput, low-latency LLM and embedding inference
Created 6 months ago
Updated 8 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
3 more.

LitServe by Lightning-AI

0.1%
4k
AI inference pipeline framework
Created 1 year ago
Updated 22 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.6%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.