KsanaLLM  by Tencent

LLM inference and serving engine

created 1 year ago
473 stars

Top 65.3% on sourcepulse

GitHubView on GitHub
Project Summary

KsanaLLM is a high-performance, user-friendly engine for LLM inference and serving, targeting researchers and developers needing efficient deployment of large language models. It offers high throughput and low latency by leveraging optimized CUDA kernels, PagedAttention, and dynamic batching, supporting a wide range of Hugging Face models and hardware including NVIDIA GPUs and Huawei Ascend NPUs.

How It Works

KsanaLLM employs optimized CUDA kernels, drawing from vLLM and TensorRT-LLM, to maximize inference speed. It features efficient memory management via PagedAttention for key-value caches and includes experimental support for dynamic batching with detailed task-scheduling and memory utilization optimizations. The engine supports multiple decoding algorithms and multi-GPU tensor parallelism for scalable deployment.

Quick Start & Requirements

  • NVIDIA GPU:
    • Install nvidia-docker and run nvcr.io/nvidia/pytorch:24.03-py3.
    • Inside the container: pip install -r requirements.txt, apt update && apt install git-lfs -y.
    • Compile with cmake -DSM=<SM_VERSION> -DWITH_TESTING=ON .. && make -j32.
  • Huawei Ascend NPU:
    • Install Ascend drivers and CANN (v8.0RC2 recommended).
    • Build Docker image: docker build -f Dockerfile.npu -t ksana-npu ..
    • Run container with specific device mappings and volume mounts.
    • Install torch_npu and requirements.txt.
    • Compile with cmake -DWITH_TESTING=ON -DWITH_CUDA=OFF -DWITH_ACL=ON .. && make -j32.
  • General: Clone repository with git clone --recurse-submodules.
  • Documentation: Optional Weight Map Guide, Optional KV Scale Guide.

Highlighted Details

  • Supports Llama, Baichuan, Qwen, and Yi models, including Llama3 8B/70B and Qwen1.5 72B/110B.
  • Offers an OpenAI-compatible API server for seamless integration.
  • Provides streaming output capabilities for interactive applications.
  • Tested on NVIDIA A10, A100, L20, L40 and Huawei Ascend 910B2C.

Maintenance & Community

  • No specific contributors, sponsorships, or community links (e.g., Discord/Slack) are mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

  • Dynamic batching is marked as experimental.
  • Ascend NPU support is limited to Ascend NPU + X86 CPU configurations.
  • Specific CUDA SM versions need to be set during compilation for NVIDIA GPUs.
Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
0
Star History
154 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 17 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

ColossalAI by hpcaitech

0.1%
41k
AI system for large-scale parallel training
created 3 years ago
updated 3 days ago
Feedback? Help us improve.