KsanaLLM  by Tencent

LLM inference and serving engine

Created 1 year ago
499 stars

Top 62.2% on SourcePulse

GitHubView on GitHub
Project Summary

KsanaLLM is a high-performance, user-friendly engine for LLM inference and serving, targeting researchers and developers needing efficient deployment of large language models. It offers high throughput and low latency by leveraging optimized CUDA kernels, PagedAttention, and dynamic batching, supporting a wide range of Hugging Face models and hardware including NVIDIA GPUs and Huawei Ascend NPUs.

How It Works

KsanaLLM employs optimized CUDA kernels, drawing from vLLM and TensorRT-LLM, to maximize inference speed. It features efficient memory management via PagedAttention for key-value caches and includes experimental support for dynamic batching with detailed task-scheduling and memory utilization optimizations. The engine supports multiple decoding algorithms and multi-GPU tensor parallelism for scalable deployment.

Quick Start & Requirements

  • NVIDIA GPU:
    • Install nvidia-docker and run nvcr.io/nvidia/pytorch:24.03-py3.
    • Inside the container: pip install -r requirements.txt, apt update && apt install git-lfs -y.
    • Compile with cmake -DSM=<SM_VERSION> -DWITH_TESTING=ON .. && make -j32.
  • Huawei Ascend NPU:
    • Install Ascend drivers and CANN (v8.0RC2 recommended).
    • Build Docker image: docker build -f Dockerfile.npu -t ksana-npu ..
    • Run container with specific device mappings and volume mounts.
    • Install torch_npu and requirements.txt.
    • Compile with cmake -DWITH_TESTING=ON -DWITH_CUDA=OFF -DWITH_ACL=ON .. && make -j32.
  • General: Clone repository with git clone --recurse-submodules.
  • Documentation: Optional Weight Map Guide, Optional KV Scale Guide.

Highlighted Details

  • Supports Llama, Baichuan, Qwen, and Yi models, including Llama3 8B/70B and Qwen1.5 72B/110B.
  • Offers an OpenAI-compatible API server for seamless integration.
  • Provides streaming output capabilities for interactive applications.
  • Tested on NVIDIA A10, A100, L20, L40 and Huawei Ascend 910B2C.

Maintenance & Community

  • No specific contributors, sponsorships, or community links (e.g., Discord/Slack) are mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

  • Dynamic batching is marked as experimental.
  • Ascend NPU support is limited to Ascend NPU + X86 CPU configurations.
  • Specific CUDA SM versions need to be set during compilation for NVIDIA GPUs.
Health Check
Last Commit

6 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
19 stars in the last 30 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Coauthor of SGLang) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm-analysis by cli99

0.4%
455
CLI tool for LLM latency/memory analysis during training/inference
Created 2 years ago
Updated 5 months ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Nikola Borisov Nikola Borisov(Founder and CEO of DeepInfra), and
3 more.

tensorrtllm_backend by triton-inference-server

0.2%
889
Triton backend for serving TensorRT-LLM models
Created 2 years ago
Updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
58 more.

vllm by vllm-project

1.1%
58k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 13 hours ago
Feedback? Help us improve.