punica  by punica-ai

LoRA serving system (research paper) for multi-tenant LLM inference

created 1 year ago
1,078 stars

Top 35.8% on sourcepulse

GitHubView on GitHub
Project Summary

Punica addresses the challenge of efficiently serving multiple LoRA-finetuned Large Language Models (LLMs) simultaneously. It targets users who need to deploy diverse LLM specializations without the prohibitive cost of running each independently, offering significant throughput gains by consolidating them.

How It Works

Punica leverages the additive nature of LoRA weights (W + A@B) by mathematically combining multiple LoRA adapters. For a batch of inputs, each directed to a specific LoRA model, Punica computes the base LLM output once for the entire batch. The LoRA-specific additions are then computed efficiently using a custom CUDA kernel called Segmented Gather Matrix-Vector multiplication (SGMV), which preserves the strong batching benefits of the base model.

Quick Start & Requirements

  • Installation: pip install punica -i https://punica-ai.github.io/whl/cu121/ --extra-index-url https://pypi.org/simple (adjust cu121 for your CUDA version). Building from source requires ninja, numpy, and torch.
  • Prerequisites: CUDA (pre-built wheels support 11.8, 12.1), Python 3.10, 3.11. Building from source may require setting TORCH_CUDA_ARCH_LIST.
  • Resources: Requires a CUDA-enabled GPU.
  • Links: Demo, Paper

Highlighted Details

  • Achieves up to 12x higher text generation throughput compared to state-of-the-art systems like vLLM, FasterTransformer, DeepSpeed, and HuggingFace Transformers.
  • Utilizes a custom SGMV CUDA kernel for efficient LoRA computation.
  • Preserves the strong batching effect of the base LLM.
  • Supports serving multiple LoRA models concurrently with minimal overhead.

Maintenance & Community

  • The project is associated with researchers from the University of Washington.
  • A paper is available for in-depth understanding.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. This requires further investigation for commercial or closed-source use.

Limitations & Caveats

  • Pre-built wheels are tied to specific CUDA and Python versions, potentially requiring source compilation for compatibility.
  • The lack of a clearly stated license is a significant caveat for adoption.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
24 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
2 more.

S-LoRA by S-LoRA

0.1%
2k
System for scalable LoRA adapter serving
created 1 year ago
updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Zhuohan Li Zhuohan Li(Author of vLLM), and
6 more.

torchtitan by pytorch

0.9%
4k
PyTorch platform for generative AI model training research
created 1 year ago
updated 22 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
6 more.

LoRA by microsoft

0.3%
12k
PyTorch library for low-rank adaptation (LoRA) of LLMs
created 4 years ago
updated 7 months ago
Feedback? Help us improve.