Multi-LoRA inference server for serving 1000s of fine-tuned LLMs
Top 14.9% on sourcepulse
LoRAX is an inference server designed to serve thousands of fine-tuned LoRA adapters on a single GPU, significantly reducing serving costs for LLMs. It targets developers and researchers needing to deploy multiple specialized LLM variants efficiently, offering high throughput and low latency.
How It Works
LoRAX employs dynamic adapter loading, allowing adapters from HuggingFace or local filesystems to be loaded on-demand per request without blocking. It features heterogeneous continuous batching to group requests for different adapters within the same batch, maintaining consistent performance. The system also includes adapter exchange scheduling for offloading between GPU and CPU memory and prefetching, optimizing aggregate throughput.
Quick Start & Requirements
model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/predibase/lorax:main --model-id $model
nvidia-container-toolkit
.Highlighted Details
Maintenance & Community
text-generation-inference
(forked from v0.9.4).Licensing & Compatibility
Limitations & Caveats
text-generation-inference
v0.9.4, implying potential divergence or reliance on that version's specific features and bug fixes.2 months ago
1 week