lorax  by predibase

Multi-LoRA inference server for serving 1000s of fine-tuned LLMs

Created 1 year ago
3,420 stars

Top 14.1% on SourcePulse

GitHubView on GitHub
Project Summary

LoRAX is an inference server designed to serve thousands of fine-tuned LoRA adapters on a single GPU, significantly reducing serving costs for LLMs. It targets developers and researchers needing to deploy multiple specialized LLM variants efficiently, offering high throughput and low latency.

How It Works

LoRAX employs dynamic adapter loading, allowing adapters from HuggingFace or local filesystems to be loaded on-demand per request without blocking. It features heterogeneous continuous batching to group requests for different adapters within the same batch, maintaining consistent performance. The system also includes adapter exchange scheduling for offloading between GPU and CPU memory and prefetching, optimizing aggregate throughput.

Quick Start & Requirements

  • Install/Run: Recommended via pre-built Docker image.
    model=mistralai/Mistral-7B-Instruct-v0.1
    volume=$PWD/data
    docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
        ghcr.io/predibase/lorax:main --model-id $model
    
  • Prerequisites: NVIDIA GPU (Ampere+), CUDA 11.8+, Linux, Docker, nvidia-container-toolkit.
  • Resources: Requires GPU memory for the base model and dynamic loading of adapters.
  • Docs: Getting Started - Docker, REST API, Python Client, OpenAI Compatible API.

Highlighted Details

  • Supports dynamic loading and merging of LoRA adapters from HuggingFace or local filesystems.
  • Achieves high throughput and low latency via optimizations like tensor parallelism, flash-attention, paged attention, SGMV, quantization, and token streaming.
  • Offers production-ready features including Docker images, Kubernetes Helm charts, Prometheus metrics, and OpenTelemetry integration.
  • Provides an OpenAI-compatible API for multi-turn chat and dynamic adapter selection.

Maintenance & Community

Licensing & Compatibility

  • License: Apache 2.0.
  • Free for commercial use.

Limitations & Caveats

  • Requires NVIDIA GPUs (Ampere generation or newer) and specific CUDA versions, limiting hardware compatibility.
  • The project is forked from text-generation-inference v0.9.4, implying potential divergence or reliance on that version's specific features and bug fixes.
Health Check
Last Commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
40 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

punica by punica-ai

0.2%
1k
LoRA serving system (research paper) for multi-tenant LLM inference
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.2%
2k
System for scalable LoRA adapter serving
Created 1 year ago
Updated 1 year ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
4 more.

ktransformers by kvcache-ai

0.3%
15k
Framework for LLM inference optimization experimentation
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.