punica  by punica-ai

LoRA serving system (research paper) for multi-tenant LLM inference

Created 2 years ago
1,087 stars

Top 35.0% on SourcePulse

GitHubView on GitHub
Project Summary

Punica addresses the challenge of efficiently serving multiple LoRA-finetuned Large Language Models (LLMs) simultaneously. It targets users who need to deploy diverse LLM specializations without the prohibitive cost of running each independently, offering significant throughput gains by consolidating them.

How It Works

Punica leverages the additive nature of LoRA weights (W + A@B) by mathematically combining multiple LoRA adapters. For a batch of inputs, each directed to a specific LoRA model, Punica computes the base LLM output once for the entire batch. The LoRA-specific additions are then computed efficiently using a custom CUDA kernel called Segmented Gather Matrix-Vector multiplication (SGMV), which preserves the strong batching benefits of the base model.

Quick Start & Requirements

  • Installation: pip install punica -i https://punica-ai.github.io/whl/cu121/ --extra-index-url https://pypi.org/simple (adjust cu121 for your CUDA version). Building from source requires ninja, numpy, and torch.
  • Prerequisites: CUDA (pre-built wheels support 11.8, 12.1), Python 3.10, 3.11. Building from source may require setting TORCH_CUDA_ARCH_LIST.
  • Resources: Requires a CUDA-enabled GPU.
  • Links: Demo, Paper

Highlighted Details

  • Achieves up to 12x higher text generation throughput compared to state-of-the-art systems like vLLM, FasterTransformer, DeepSpeed, and HuggingFace Transformers.
  • Utilizes a custom SGMV CUDA kernel for efficient LoRA computation.
  • Preserves the strong batching effect of the base LLM.
  • Supports serving multiple LoRA models concurrently with minimal overhead.

Maintenance & Community

  • The project is associated with researchers from the University of Washington.
  • A paper is available for in-depth understanding.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. This requires further investigation for commercial or closed-source use.

Limitations & Caveats

  • Pre-built wheels are tied to specific CUDA and Python versions, potentially requiring source compilation for compatibility.
  • The lack of a clearly stated license is a significant caveat for adoption.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.2%
2k
System for scalable LoRA adapter serving
Created 1 year ago
Updated 1 year ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
8 more.

lorax by predibase

0.2%
3k
Multi-LoRA inference server for serving 1000s of fine-tuned LLMs
Created 1 year ago
Updated 4 months ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
4 more.

ktransformers by kvcache-ai

0.3%
15k
Framework for LLM inference optimization experimentation
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.