LoRA serving system (research paper) for multi-tenant LLM inference
Top 35.8% on sourcepulse
Punica addresses the challenge of efficiently serving multiple LoRA-finetuned Large Language Models (LLMs) simultaneously. It targets users who need to deploy diverse LLM specializations without the prohibitive cost of running each independently, offering significant throughput gains by consolidating them.
How It Works
Punica leverages the additive nature of LoRA weights (W + A@B) by mathematically combining multiple LoRA adapters. For a batch of inputs, each directed to a specific LoRA model, Punica computes the base LLM output once for the entire batch. The LoRA-specific additions are then computed efficiently using a custom CUDA kernel called Segmented Gather Matrix-Vector multiplication (SGMV), which preserves the strong batching benefits of the base model.
Quick Start & Requirements
pip install punica -i https://punica-ai.github.io/whl/cu121/ --extra-index-url https://pypi.org/simple
(adjust cu121
for your CUDA version). Building from source requires ninja
, numpy
, and torch
.TORCH_CUDA_ARCH_LIST
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
1 day