ZhiLight  by zhihu

LLM inference engine for Llama and variants, optimized for PCIe GPUs

created 8 months ago
902 stars

Top 41.1% on sourcepulse

GitHubView on GitHub
Project Summary

ZhiLight is an LLM inference acceleration engine designed to significantly boost inference performance for Llama and its variants, particularly on PCIe-based GPUs. It targets researchers and developers seeking higher throughput and lower latency compared to existing solutions like vLLM, offering an OpenAI-compatible API for ease of integration.

How It Works

ZhiLight employs a custom-defined tensor and unified global memory management system, enabling optimizations like encode and all-reduce overlap ("dual streams") and INT8-quantized all-reduce to minimize communication overhead. It features optimized fused kernels for operations such as QKV, residual connections, and layernorm, along with fused batch attention for decoding leveraging tensor core instructions. The engine supports Tensor Parallelism (TP) and Pipeline Parallelism (PP) on a single node, with TP being recommended.

Quick Start & Requirements

  • Install: pip install -e . from the cloned repository, or docker pull ghcr.io/zhihu/zhilight/zhilight:0.4.8-cu124.
  • Prerequisites: CUDA runtime, cuBLAS, NCCL.
  • Usage: Start an OpenAI-compatible server via python -m zhilight.server.openai.entrypoints.api_server [options].
  • Docs: Roadmap

Highlighted Details

  • Claims significant performance advantages over vLLM and SGLang on various NVIDIA GPUs (AD102, A800, H20) for models from 2B to 110B parameters.
  • Supports a wide range of quantization methods including Native INT8, SmoothQuant, FP8, AWQ, GPTQ, and Marlin kernels.
  • Features include dynamic batching, flash attention prefill, chunked prefill, prefix caching, and support for Mixture of Experts (MoE) models like DeepseekV2.
  • Compatible with Llama, Llama2, Mixtral, Qwen2 series, and DeepSeek-V3/R1 models.

Maintenance & Community

  • Notable contributors include @a710128, @spetrel, @unix1986, @gnap.
  • No explicit community links (Discord/Slack) or roadmap URL provided in the README.

Licensing & Compatibility

  • License: Apache License 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

The provided benchmarks show mixed results, with ZhiLight sometimes exhibiting lower QPS or higher latency compared to vLLM or SGLang on specific configurations (e.g., Qwen2-72B-Instruct-GPTQ-Int4 on AD102 PCIe). The README does not detail specific hardware requirements beyond CUDA, nor does it provide explicit guidance on setup time or resource footprint.

Health Check
Last commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
0
Star History
24 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.3%
3k
High-performance 4-bit diffusion model inference engine
created 9 months ago
updated 7 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 20 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 8 hours ago
Feedback? Help us improve.