punica by punica-ai

LoRA serving system (research paper) for multi-tenant LLM inference

Created 2 years ago

1,131 stars

Top 33.9% on SourcePulse

View on GitHub

8 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Jeff Hammerbacher

Cofounder of Cloudera

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Woosuk Kwon

Coauthor of vLLM

and 4 more!

Project Summary

Punica addresses the challenge of efficiently serving multiple LoRA-finetuned Large Language Models (LLMs) simultaneously. It targets users who need to deploy diverse LLM specializations without the prohibitive cost of running each independently, offering significant throughput gains by consolidating them.

How It Works

Punica leverages the additive nature of LoRA weights (W + A@B) by mathematically combining multiple LoRA adapters. For a batch of inputs, each directed to a specific LoRA model, Punica computes the base LLM output once for the entire batch. The LoRA-specific additions are then computed efficiently using a custom CUDA kernel called Segmented Gather Matrix-Vector multiplication (SGMV), which preserves the strong batching benefits of the base model.

Quick Start & Requirements

Installation: pip install punica -i https://punica-ai.github.io/whl/cu121/ --extra-index-url https://pypi.org/simple (adjust cu121 for your CUDA version). Building from source requires ninja, numpy, and torch.
Prerequisites: CUDA (pre-built wheels support 11.8, 12.1), Python 3.10, 3.11. Building from source may require setting TORCH_CUDA_ARCH_LIST.
Resources: Requires a CUDA-enabled GPU.
Links: Demo, Paper

Highlighted Details

Achieves up to 12x higher text generation throughput compared to state-of-the-art systems like vLLM, FasterTransformer, DeepSpeed, and HuggingFace Transformers.
Utilizes a custom SGMV CUDA kernel for efficient LoRA computation.
Preserves the strong batching effect of the base LLM.
Supports serving multiple LoRA models concurrently with minimal overhead.

Maintenance & Community

The project is associated with researchers from the University of Washington.
A paper is available for in-depth understanding.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires further investigation for commercial or closed-source use.

Limitations & Caveats

Pre-built wheels are tied to specific CUDA and Python versions, potentially requiring source compilation for compatibility.
The lack of a clearly stated license is a significant caveat for adoption.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days