System for scalable LoRA adapter serving
Top 24.0% on sourcepulse
S-LoRA is a system designed to efficiently serve thousands of concurrent LoRA adapters for large language models. It targets researchers and developers deploying customized LLMs, enabling scalable serving of many task-specific fine-tuned models with significantly improved throughput.
How It Works
S-LoRA stores all LoRA adapters in main memory and dynamically loads them to GPU memory as needed. It employs "Unified Paging" to manage adapter weights and KV cache tensors in a unified memory pool, reducing fragmentation and increasing batch size. Custom CUDA kernels and a novel tensor parallelism strategy facilitate heterogeneous batching of LoRA computations, minimizing latency and communication overhead.
Quick Start & Requirements
conda create -n slora python=3.9
conda activate slora
# Optional: conda install cuda -c nvidia/label/cuda-11.8.0
export TORCH_CUDA_ARCH_LIST="8.0 8.6"
pip install torch==2.0.1 triton==2.1.0
pip install -e .
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
1 day