ZhiLight by zhihu

LLM inference engine for Llama and variants, optimized for PCIe GPUs

Created 1 year ago

907 stars

Top 40.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

ZhiLight is an LLM inference acceleration engine designed to significantly boost inference performance for Llama and its variants, particularly on PCIe-based GPUs. It targets researchers and developers seeking higher throughput and lower latency compared to existing solutions like vLLM, offering an OpenAI-compatible API for ease of integration.

How It Works

ZhiLight employs a custom-defined tensor and unified global memory management system, enabling optimizations like encode and all-reduce overlap ("dual streams") and INT8-quantized all-reduce to minimize communication overhead. It features optimized fused kernels for operations such as QKV, residual connections, and layernorm, along with fused batch attention for decoding leveraging tensor core instructions. The engine supports Tensor Parallelism (TP) and Pipeline Parallelism (PP) on a single node, with TP being recommended.

Quick Start & Requirements

Install: pip install -e . from the cloned repository, or docker pull ghcr.io/zhihu/zhilight/zhilight:0.4.8-cu124.
Prerequisites: CUDA runtime, cuBLAS, NCCL.
Usage: Start an OpenAI-compatible server via python -m zhilight.server.openai.entrypoints.api_server [options].
Docs: Roadmap

Highlighted Details

Claims significant performance advantages over vLLM and SGLang on various NVIDIA GPUs (AD102, A800, H20) for models from 2B to 110B parameters.
Supports a wide range of quantization methods including Native INT8, SmoothQuant, FP8, AWQ, GPTQ, and Marlin kernels.
Features include dynamic batching, flash attention prefill, chunked prefill, prefix caching, and support for Mixture of Experts (MoE) models like DeepseekV2.
Compatible with Llama, Llama2, Mixtral, Qwen2 series, and DeepSeek-V3/R1 models.

Maintenance & Community

Notable contributors include @a710128, @spetrel, @unix1986, @gnap.
No explicit community links (Discord/Slack) or roadmap URL provided in the README.

Licensing & Compatibility

License: Apache License 2.0.
Compatibility: Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

The provided benchmarks show mixed results, with ZhiLight sometimes exhibiting lower QPS or higher latency compared to vLLM or SGLang on specific configurations (e.g., Qwen2-72B-Instruct-GPTQ-Int4 on AD102 PCIe). The README does not detail specific hardware requirements beyond CUDA, nor does it provide explicit guidance on setup time or resource footprint.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days