ZhiLight  by zhihu

LLM inference engine for Llama and variants, optimized for PCIe GPUs

Created 9 months ago
899 stars

Top 40.4% on SourcePulse

GitHubView on GitHub
Project Summary

ZhiLight is an LLM inference acceleration engine designed to significantly boost inference performance for Llama and its variants, particularly on PCIe-based GPUs. It targets researchers and developers seeking higher throughput and lower latency compared to existing solutions like vLLM, offering an OpenAI-compatible API for ease of integration.

How It Works

ZhiLight employs a custom-defined tensor and unified global memory management system, enabling optimizations like encode and all-reduce overlap ("dual streams") and INT8-quantized all-reduce to minimize communication overhead. It features optimized fused kernels for operations such as QKV, residual connections, and layernorm, along with fused batch attention for decoding leveraging tensor core instructions. The engine supports Tensor Parallelism (TP) and Pipeline Parallelism (PP) on a single node, with TP being recommended.

Quick Start & Requirements

  • Install: pip install -e . from the cloned repository, or docker pull ghcr.io/zhihu/zhilight/zhilight:0.4.8-cu124.
  • Prerequisites: CUDA runtime, cuBLAS, NCCL.
  • Usage: Start an OpenAI-compatible server via python -m zhilight.server.openai.entrypoints.api_server [options].
  • Docs: Roadmap

Highlighted Details

  • Claims significant performance advantages over vLLM and SGLang on various NVIDIA GPUs (AD102, A800, H20) for models from 2B to 110B parameters.
  • Supports a wide range of quantization methods including Native INT8, SmoothQuant, FP8, AWQ, GPTQ, and Marlin kernels.
  • Features include dynamic batching, flash attention prefill, chunked prefill, prefix caching, and support for Mixture of Experts (MoE) models like DeepseekV2.
  • Compatible with Llama, Llama2, Mixtral, Qwen2 series, and DeepSeek-V3/R1 models.

Maintenance & Community

  • Notable contributors include @a710128, @spetrel, @unix1986, @gnap.
  • No explicit community links (Discord/Slack) or roadmap URL provided in the README.

Licensing & Compatibility

  • License: Apache License 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

The provided benchmarks show mixed results, with ZhiLight sometimes exhibiting lower QPS or higher latency compared to vLLM or SGLang on specific configurations (e.g., Qwen2-72B-Instruct-GPTQ-Int4 on AD102 PCIe). The README does not detail specific hardware requirements beyond CUDA, nor does it provide explicit guidance on setup time or resource footprint.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
11 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 21 hours ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.