KsanaLLM by Tencent

LLM inference and serving engine

Created 1 year ago

518 stars

Top 60.5% on SourcePulse

Project Summary

KsanaLLM is a high-performance, user-friendly engine for LLM inference and serving, targeting researchers and developers needing efficient deployment of large language models. It offers high throughput and low latency by leveraging optimized CUDA kernels, PagedAttention, and dynamic batching, supporting a wide range of Hugging Face models and hardware including NVIDIA GPUs and Huawei Ascend NPUs.

How It Works

KsanaLLM employs optimized CUDA kernels, drawing from vLLM and TensorRT-LLM, to maximize inference speed. It features efficient memory management via PagedAttention for key-value caches and includes experimental support for dynamic batching with detailed task-scheduling and memory utilization optimizations. The engine supports multiple decoding algorithms and multi-GPU tensor parallelism for scalable deployment.

Quick Start & Requirements

NVIDIA GPU:
- Install nvidia-docker and run nvcr.io/nvidia/pytorch:24.03-py3.
- Inside the container: pip install -r requirements.txt, apt update && apt install git-lfs -y.
- Compile with cmake -DSM=<SM_VERSION> -DWITH_TESTING=ON .. && make -j32.
Huawei Ascend NPU:
- Install Ascend drivers and CANN (v8.0RC2 recommended).
- Build Docker image: docker build -f Dockerfile.npu -t ksana-npu ..
- Run container with specific device mappings and volume mounts.
- Install torch_npu and requirements.txt.
- Compile with cmake -DWITH_TESTING=ON -DWITH_CUDA=OFF -DWITH_ACL=ON .. && make -j32.
General: Clone repository with git clone --recurse-submodules.
Documentation: Optional Weight Map Guide, Optional KV Scale Guide.

Highlighted Details

Supports Llama, Baichuan, Qwen, and Yi models, including Llama3 8B/70B and Qwen1.5 72B/110B.
Offers an OpenAI-compatible API server for seamless integration.
Provides streaming output capabilities for interactive applications.
Tested on NVIDIA A10, A100, L20, L40 and Huawei Ascend 910B2C.

Maintenance & Community

No specific contributors, sponsorships, or community links (e.g., Discord/Slack) are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license.

Limitations & Caveats

Dynamic batching is marked as experimental.
Ascend NPU support is limited to Ascend NPU + X86 CPU configurations.
Specific CUDA SM versions need to be set during compilation for NVIDIA GPUs.

Health Check

Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)

1

Issues (30d)

0

Star History

2 stars in the last 30 days

Explore Similar Projects

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dash-infer by modelscope

LLM inference engine for optimized performance across diverse hardware

Created 1 year ago

Updated 5 months ago

Starred by

Zhuohan Li

Zhuohan Li(Coauthor of vLLM).

vattention by microsoft

Memory manager for LLM serving systems

Created 1 year ago

Updated 7 months ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

ScaleLLM by vectorch-ai

LLM inference system for production environments

Created 2 years ago

Updated 3 weeks ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang) and

Stas Bekman

Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm-analysis by cli99

CLI tool for LLM latency/memory analysis during training/inference

Created 2 years ago

Updated 8 months ago

kuiperdatawhale by zjhellofss

Course for building a deep learning inference framework

Created 2 years ago

Updated 1 year ago

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

ZhiLight by zhihu

LLM inference engine for Llama and variants, optimized for PCIe GPUs

Created 1 year ago

Updated 6 months ago

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI),

Nikola Borisov

Nikola Borisov(Founder and CEO of DeepInfra), and

3 more.

tensorrtllm_backend by triton-inference-server

Triton backend for serving TensorRT-LLM models

Created 2 years ago

Updated 2 days ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

LLM inference engine for diverse applications

Created 2 years ago

Updated 14 hours ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI),

Michael Han

Michael Han(Cofounder of Unsloth), and

4 more.

aphrodite-engine by aphrodite-engine

LLM inference engine for serving HuggingFace models at scale

Created 2 years ago

Updated 4 days ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Ying Sheng

Ying Sheng(Coauthor of SGLang), and

8 more.

DeepSpeed-MII by deepspeedai

Python library for high-throughput, low-latency, and cost-effective model inference

Created 3 years ago

Updated 6 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

14 more.

flashinfer by flashinfer-ai

Kernel library for LLM serving

Created 2 years ago

Updated 15 hours ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Clement Delangue

Clement Delangue(Cofounder of Hugging Face), and

60 more.

vllm by vllm-project

LLM serving engine for high-throughput, memory-efficient inference

Created 2 years ago

Updated 14 hours ago

Feedback? Help us improve.