LLM inference engine for Llama and variants, optimized for PCIe GPUs
Top 41.1% on sourcepulse
ZhiLight is an LLM inference acceleration engine designed to significantly boost inference performance for Llama and its variants, particularly on PCIe-based GPUs. It targets researchers and developers seeking higher throughput and lower latency compared to existing solutions like vLLM, offering an OpenAI-compatible API for ease of integration.
How It Works
ZhiLight employs a custom-defined tensor and unified global memory management system, enabling optimizations like encode and all-reduce overlap ("dual streams") and INT8-quantized all-reduce to minimize communication overhead. It features optimized fused kernels for operations such as QKV, residual connections, and layernorm, along with fused batch attention for decoding leveraging tensor core instructions. The engine supports Tensor Parallelism (TP) and Pipeline Parallelism (PP) on a single node, with TP being recommended.
Quick Start & Requirements
pip install -e .
from the cloned repository, or docker pull ghcr.io/zhihu/zhilight/zhilight:0.4.8-cu124
.python -m zhilight.server.openai.entrypoints.api_server [options]
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The provided benchmarks show mixed results, with ZhiLight sometimes exhibiting lower QPS or higher latency compared to vLLM or SGLang on specific configurations (e.g., Qwen2-72B-Instruct-GPTQ-Int4 on AD102 PCIe). The README does not detail specific hardware requirements beyond CUDA, nor does it provide explicit guidance on setup time or resource footprint.
3 weeks ago
1 day