S-LoRA  by S-LoRA

System for scalable LoRA adapter serving

created 1 year ago
1,845 stars

Top 24.0% on sourcepulse

GitHubView on GitHub
Project Summary

S-LoRA is a system designed to efficiently serve thousands of concurrent LoRA adapters for large language models. It targets researchers and developers deploying customized LLMs, enabling scalable serving of many task-specific fine-tuned models with significantly improved throughput.

How It Works

S-LoRA stores all LoRA adapters in main memory and dynamically loads them to GPU memory as needed. It employs "Unified Paging" to manage adapter weights and KV cache tensors in a unified memory pool, reducing fragmentation and increasing batch size. Custom CUDA kernels and a novel tensor parallelism strategy facilitate heterogeneous batching of LoRA computations, minimizing latency and communication overhead.

Quick Start & Requirements

  • Installation:
    conda create -n slora python=3.9
    conda activate slora
    # Optional: conda install cuda -c nvidia/label/cuda-11.8.0
    export TORCH_CUDA_ARCH_LIST="8.0 8.6"
    pip install torch==2.0.1 triton==2.1.0
    pip install -e .
    
  • Prerequisites: CUDA 11.8 compatible GPU (Ampere family recommended, Turing family T4 not supported). PyTorch 1.13 <= version <= 2.0.1.
  • Resources: Requires significant GPU memory to hold adapters and KV caches.
  • Links: Paper, VTC Paper

Highlighted Details

  • Achieves up to 4x higher throughput and serves orders of magnitude more adapters compared to HuggingFace PEFT and naive vLLM LoRA serving.
  • Unified Paging manages dynamic adapter weights and KV cache tensors in a single memory pool.
  • Heterogeneous batching with custom CUDA kernels optimizes for varying adapter ranks and sequence lengths.
  • Novel tensor parallelism strategy minimizes communication overhead for LoRA computations.

Maintenance & Community

  • Built on top of LightLLM.
  • Roadmap includes tensor parallelism implementation, script cleanup, user-friendly API, and broader model support.
  • Discord/Slack link provided.

Licensing & Compatibility

  • License not explicitly stated in the README. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

  • Does not support older GPUs like the NVIDIA T4 that lack bfloat16 operations.
  • Tensor parallelism implementation is still on the roadmap, suggesting potential for further optimization.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
28 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

punica by punica-ai

0%
1k
LoRA serving system (research paper) for multi-tenant LLM inference
created 1 year ago
updated 1 year ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
6 more.

FasterTransformer by NVIDIA

0.2%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Feedback? Help us improve.