omniserve by mit-han-lab

Unified inference engine for large-scale LLM serving

Created 1 year ago

801 stars

Top 44.1% on SourcePulse

Project Summary

OmniServe is a unified LLM inference engine designed to optimize both low-bit quantization and long-context processing for large-scale serving. It integrates innovations from QServe (W4A8KV4 quantization) and LServe (unified sparse attention), targeting researchers and engineers deploying LLMs, aiming to significantly reduce serving costs and improve throughput.

How It Works

OmniServe combines QServe's W4A8KV4 quantization, which minimizes dequantization overheads through progressive quantization and compute-aware optimizations, with LServe's hybrid sparse attention. LServe employs hardware-friendly structured sparsity patterns and a hierarchical KV cache pruning policy to skip computations on less important tokens, accelerating both prefill and decoding stages. This dual approach addresses computational complexity and memory bottlenecks for efficient LLM serving.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (conda create -n OmniServe python=3.10), activate it, install PyTorch with matching CUDA toolkit, and then install OmniServe (pip install -e .). FlashAttention-2 and Block-Sparse-Attention also require specific installation steps, potentially involving pre-built wheels or compiling from source.
Prerequisites: Python 3.10, PyTorch (version must match FlashAttention wheels), CUDA Toolkit, git-lfs for model zoo.
Resources: Requires significant GPU memory for LLM serving. Pre-quantized checkpoints are available via Hugging Face.
Docs: Website (QServe), Website (LServe)

Highlighted Details

Achieves 1.2x-1.4x higher throughput than TensorRT-LLM on Llama-3-8B and 2.4x-3.5x on Qwen1.5-72B.
Enables A100-level throughput on L40S GPUs, reducing serving costs by up to 3x.
Integrates W4A8KV4 quantization and unified sparse attention for optimized long-context and quantized LLM inference.
Supports in-flight batching and paged attention.

Maintenance & Community

Maintained by MIT HAN LAB, with contributions from researchers at MIT, SJTU, UC San Diego, and NVIDIA.
Related projects include DeepCompressor, AWQ, TinyChat, VILA, SmoothQuant, StreamingLLM, and SpAtten.
MIT HAN LAB

Licensing & Compatibility

The repository itself does not explicitly state a license in the README. However, related projects like DeepCompressor and AWQ are typically under permissive licenses (e.g., Apache 2.0). Specific model checkpoints may have their own licenses.

Limitations & Caveats

Installation of dependencies like FlashAttention-2 and Block-Sparse-Attention can be complex, requiring careful matching of PyTorch and CUDA versions, and potentially manual wheel installation or compilation.
The README mentions that the automatic GPU page allocation algorithm is conservative, recommending manual adjustment of NUM_GPU_PAGE_BLOCKS for optimal memory utilization.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days