S-LoRA  by S-LoRA

System for scalable LoRA adapter serving

Created 1 year ago
1,853 stars

Top 23.4% on SourcePulse

GitHubView on GitHub
Project Summary

S-LoRA is a system designed to efficiently serve thousands of concurrent LoRA adapters for large language models. It targets researchers and developers deploying customized LLMs, enabling scalable serving of many task-specific fine-tuned models with significantly improved throughput.

How It Works

S-LoRA stores all LoRA adapters in main memory and dynamically loads them to GPU memory as needed. It employs "Unified Paging" to manage adapter weights and KV cache tensors in a unified memory pool, reducing fragmentation and increasing batch size. Custom CUDA kernels and a novel tensor parallelism strategy facilitate heterogeneous batching of LoRA computations, minimizing latency and communication overhead.

Quick Start & Requirements

  • Installation:
    conda create -n slora python=3.9
    conda activate slora
    # Optional: conda install cuda -c nvidia/label/cuda-11.8.0
    export TORCH_CUDA_ARCH_LIST="8.0 8.6"
    pip install torch==2.0.1 triton==2.1.0
    pip install -e .
    
  • Prerequisites: CUDA 11.8 compatible GPU (Ampere family recommended, Turing family T4 not supported). PyTorch 1.13 <= version <= 2.0.1.
  • Resources: Requires significant GPU memory to hold adapters and KV caches.
  • Links: Paper, VTC Paper

Highlighted Details

  • Achieves up to 4x higher throughput and serves orders of magnitude more adapters compared to HuggingFace PEFT and naive vLLM LoRA serving.
  • Unified Paging manages dynamic adapter weights and KV cache tensors in a single memory pool.
  • Heterogeneous batching with custom CUDA kernels optimizes for varying adapter ranks and sequence lengths.
  • Novel tensor parallelism strategy minimizes communication overhead for LoRA computations.

Maintenance & Community

  • Built on top of LightLLM.
  • Roadmap includes tensor parallelism implementation, script cleanup, user-friendly API, and broader model support.
  • Discord/Slack link provided.

Licensing & Compatibility

  • License not explicitly stated in the README. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

  • Does not support older GPUs like the NVIDIA T4 that lack bfloat16 operations.
  • Tensor parallelism implementation is still on the roadmap, suggesting potential for further optimization.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

punica by punica-ai

0.2%
1k
LoRA serving system (research paper) for multi-tenant LLM inference
Created 2 years ago
Updated 1 year ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
8 more.

lorax by predibase

0.2%
3k
Multi-LoRA inference server for serving 1000s of fine-tuned LLMs
Created 1 year ago
Updated 4 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
58 more.

vllm by vllm-project

1.1%
58k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 12 hours ago
Feedback? Help us improve.