LLM inference and serving engine
Top 65.3% on sourcepulse
KsanaLLM is a high-performance, user-friendly engine for LLM inference and serving, targeting researchers and developers needing efficient deployment of large language models. It offers high throughput and low latency by leveraging optimized CUDA kernels, PagedAttention, and dynamic batching, supporting a wide range of Hugging Face models and hardware including NVIDIA GPUs and Huawei Ascend NPUs.
How It Works
KsanaLLM employs optimized CUDA kernels, drawing from vLLM and TensorRT-LLM, to maximize inference speed. It features efficient memory management via PagedAttention for key-value caches and includes experimental support for dynamic batching with detailed task-scheduling and memory utilization optimizations. The engine supports multiple decoding algorithms and multi-GPU tensor parallelism for scalable deployment.
Quick Start & Requirements
nvidia-docker
and run nvcr.io/nvidia/pytorch:24.03-py3
.pip install -r requirements.txt
, apt update && apt install git-lfs -y
.cmake -DSM=<SM_VERSION> -DWITH_TESTING=ON .. && make -j32
.docker build -f Dockerfile.npu -t ksana-npu .
.torch_npu
and requirements.txt
.cmake -DWITH_TESTING=ON -DWITH_CUDA=OFF -DWITH_ACL=ON .. && make -j32
.git clone --recurse-submodules
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 week ago
1 day