vLLM-Kunlun  by baidu

vLLM inference acceleration for Kunlun XPU

Created 2 months ago
258 stars

Top 98.1% on SourcePulse

GitHubView on GitHub
Project Summary

vLLM Kunlun is a community-maintained hardware plugin designed to enable the vLLM inference engine to run on Kunlun XPU hardware. It targets users and researchers who need to leverage Kunlun's specialized compute capabilities for deploying and experimenting with large language models, offering a decoupled integration strategy for broad model compatibility.

How It Works

This project implements a hardware-pluggable interface, aligning with vLLM's architectural principles for hardware integration. This design choice decouples the Kunlun XPU backend from the core vLLM framework, facilitating easier integration and enabling efficient execution of various LLM architectures, including Transformer-based models, Mixture-of-Experts (MoE), embedding models, and multimodal LLMs, directly on Kunlun hardware.

Quick Start & Requirements

  • Prerequisites: Requires Kunlun3 P800 hardware, Ubuntu 22.04 operating system, Python version 3.10 or higher, PyTorch version 2.5.1 or higher, and a vLLM installation matching the vllm-kunlun version.
  • Installation: Detailed setup instructions are available in the official QuickStart and Installation documentation.
  • Recommended Versions: The latest stable release is v0.11.0.

Highlighted Details

  • Supports a comprehensive range of models, including Qwen (2, 2.5, 3, 3-Moe, 3-Next), Llama (2, 3, 3.1), DeepSeek (R1, V3, V3.2), Kimi-K2, MiMo-V2-Flash, and gpt-oss.
  • Incorporates support for advanced LLM techniques such as Quantization, LoRA fine-tuning, Piecewise operations, and Kunlun Graph optimizations.
  • Demonstrates high-performance computing capabilities on Kunlun3 P800, with reported performance metrics for 16-way concurrency and 2048 input/output sizes.

Maintenance & Community

  • The project was initiated in December 2025.
  • Significant support is provided by the KunLunXin team, who supply XPU resources for development and testing.
  • Community discussions and updates can be found via their Slack channel.

Licensing & Compatibility

  • The project is licensed under the Apache License 2.0.
  • This permissive license generally allows for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

  • The plugin has specific hardware (Kunlun3 P800) and operating system (Ubuntu 22.04) requirements, which may limit its applicability on other platforms.
  • As a community-maintained project, its support structure and development velocity may differ from officially maintained vLLM components.
Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
48
Issues (30d)
17
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.3%
1k
LLM inference engine for diverse applications
Created 2 years ago
Updated 17 hours ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.1%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 7 months ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

9.5%
13k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 5 months ago
Feedback? Help us improve.