S-LoRA by S-LoRA

System for scalable LoRA adapter serving

Created 2 years ago

1,887 stars

Top 22.8% on SourcePulse

View on GitHub

6 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Johannes Hagemann

Cofounder of Prime Intellect

Wing Lian

Founder of Axolotl AI

Omar Sanseviero

DevRel at Google DeepMind

and 2 more!

Project Summary

S-LoRA is a system designed to efficiently serve thousands of concurrent LoRA adapters for large language models. It targets researchers and developers deploying customized LLMs, enabling scalable serving of many task-specific fine-tuned models with significantly improved throughput.

How It Works

S-LoRA stores all LoRA adapters in main memory and dynamically loads them to GPU memory as needed. It employs "Unified Paging" to manage adapter weights and KV cache tensors in a unified memory pool, reducing fragmentation and increasing batch size. Custom CUDA kernels and a novel tensor parallelism strategy facilitate heterogeneous batching of LoRA computations, minimizing latency and communication overhead.

Quick Start & Requirements

Installation:

conda create -n slora python=3.9
conda activate slora
# Optional: conda install cuda -c nvidia/label/cuda-11.8.0
export TORCH_CUDA_ARCH_LIST="8.0 8.6"
pip install torch==2.0.1 triton==2.1.0
pip install -e .

Prerequisites: CUDA 11.8 compatible GPU (Ampere family recommended, Turing family T4 not supported). PyTorch 1.13 <= version <= 2.0.1.
Resources: Requires significant GPU memory to hold adapters and KV caches.
Links: Paper, VTC Paper

Highlighted Details

Achieves up to 4x higher throughput and serves orders of magnitude more adapters compared to HuggingFace PEFT and naive vLLM LoRA serving.
Unified Paging manages dynamic adapter weights and KV cache tensors in a single memory pool.
Heterogeneous batching with custom CUDA kernels optimizes for varying adapter ranks and sequence lengths.
Novel tensor parallelism strategy minimizes communication overhead for LoRA computations.

Maintenance & Community

Built on top of LightLLM.
Roadmap includes tensor parallelism implementation, script cleanup, user-friendly API, and broader model support.
Discord/Slack link provided.

Licensing & Compatibility

License not explicitly stated in the README. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

Does not support older GPUs like the NVIDIA T4 that lack bfloat16 operations.
Tensor parallelism implementation is still on the roadmap, suggesting potential for further optimization.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days