Unified inference engine for large-scale LLM serving
Top 48.4% on sourcepulse
OmniServe is a unified LLM inference engine designed to optimize both low-bit quantization and long-context processing for large-scale serving. It integrates innovations from QServe (W4A8KV4 quantization) and LServe (unified sparse attention), targeting researchers and engineers deploying LLMs, aiming to significantly reduce serving costs and improve throughput.
How It Works
OmniServe combines QServe's W4A8KV4 quantization, which minimizes dequantization overheads through progressive quantization and compute-aware optimizations, with LServe's hybrid sparse attention. LServe employs hardware-friendly structured sparsity patterns and a hierarchical KV cache pruning policy to skip computations on less important tokens, accelerating both prefill and decoding stages. This dual approach addresses computational complexity and memory bottlenecks for efficient LLM serving.
Quick Start & Requirements
conda create -n OmniServe python=3.10
), activate it, install PyTorch with matching CUDA toolkit, and then install OmniServe (pip install -e .
). FlashAttention-2 and Block-Sparse-Attention also require specific installation steps, potentially involving pre-built wheels or compiling from source.git-lfs
for model zoo.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
NUM_GPU_PAGE_BLOCKS
for optimal memory utilization.5 months ago
1+ week