omniserve  by mit-han-lab

Unified inference engine for large-scale LLM serving

created 1 year ago
730 stars

Top 48.4% on sourcepulse

GitHubView on GitHub
Project Summary

OmniServe is a unified LLM inference engine designed to optimize both low-bit quantization and long-context processing for large-scale serving. It integrates innovations from QServe (W4A8KV4 quantization) and LServe (unified sparse attention), targeting researchers and engineers deploying LLMs, aiming to significantly reduce serving costs and improve throughput.

How It Works

OmniServe combines QServe's W4A8KV4 quantization, which minimizes dequantization overheads through progressive quantization and compute-aware optimizations, with LServe's hybrid sparse attention. LServe employs hardware-friendly structured sparsity patterns and a hierarchical KV cache pruning policy to skip computations on less important tokens, accelerating both prefill and decoding stages. This dual approach addresses computational complexity and memory bottlenecks for efficient LLM serving.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n OmniServe python=3.10), activate it, install PyTorch with matching CUDA toolkit, and then install OmniServe (pip install -e .). FlashAttention-2 and Block-Sparse-Attention also require specific installation steps, potentially involving pre-built wheels or compiling from source.
  • Prerequisites: Python 3.10, PyTorch (version must match FlashAttention wheels), CUDA Toolkit, git-lfs for model zoo.
  • Resources: Requires significant GPU memory for LLM serving. Pre-quantized checkpoints are available via Hugging Face.
  • Docs: Website (QServe), Website (LServe)

Highlighted Details

  • Achieves 1.2x-1.4x higher throughput than TensorRT-LLM on Llama-3-8B and 2.4x-3.5x on Qwen1.5-72B.
  • Enables A100-level throughput on L40S GPUs, reducing serving costs by up to 3x.
  • Integrates W4A8KV4 quantization and unified sparse attention for optimized long-context and quantized LLM inference.
  • Supports in-flight batching and paged attention.

Maintenance & Community

  • Maintained by MIT HAN LAB, with contributions from researchers at MIT, SJTU, UC San Diego, and NVIDIA.
  • Related projects include DeepCompressor, AWQ, TinyChat, VILA, SmoothQuant, StreamingLLM, and SpAtten.
  • MIT HAN LAB

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README. However, related projects like DeepCompressor and AWQ are typically under permissive licenses (e.g., Apache 2.0). Specific model checkpoints may have their own licenses.

Limitations & Caveats

  • Installation of dependencies like FlashAttention-2 and Block-Sparse-Attention can be complex, requiring careful matching of PyTorch and CUDA versions, and potentially manual wheel installation or compilation.
  • The README mentions that the automatic GPU page allocation algorithm is conservative, recommending manual adjustment of NUM_GPU_PAGE_BLOCKS for optimal memory utilization.
Health Check
Last commit

5 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
75 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Wei-Lin Chiang Wei-Lin Chiang(Cofounder of LMArena), and
1 more.

deepsparse by neuralmagic

0%
3k
CPU inference runtime for sparse deep learning models
created 4 years ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

llm-awq by mit-han-lab

0.4%
3k
Weight quantization research paper for LLM compression/acceleration
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.