lorax  by predibase

Multi-LoRA inference server for serving 1000s of fine-tuned LLMs

created 1 year ago
3,333 stars

Top 14.9% on sourcepulse

GitHubView on GitHub
Project Summary

LoRAX is an inference server designed to serve thousands of fine-tuned LoRA adapters on a single GPU, significantly reducing serving costs for LLMs. It targets developers and researchers needing to deploy multiple specialized LLM variants efficiently, offering high throughput and low latency.

How It Works

LoRAX employs dynamic adapter loading, allowing adapters from HuggingFace or local filesystems to be loaded on-demand per request without blocking. It features heterogeneous continuous batching to group requests for different adapters within the same batch, maintaining consistent performance. The system also includes adapter exchange scheduling for offloading between GPU and CPU memory and prefetching, optimizing aggregate throughput.

Quick Start & Requirements

  • Install/Run: Recommended via pre-built Docker image.
    model=mistralai/Mistral-7B-Instruct-v0.1
    volume=$PWD/data
    docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
        ghcr.io/predibase/lorax:main --model-id $model
    
  • Prerequisites: NVIDIA GPU (Ampere+), CUDA 11.8+, Linux, Docker, nvidia-container-toolkit.
  • Resources: Requires GPU memory for the base model and dynamic loading of adapters.
  • Docs: Getting Started - Docker, REST API, Python Client, OpenAI Compatible API.

Highlighted Details

  • Supports dynamic loading and merging of LoRA adapters from HuggingFace or local filesystems.
  • Achieves high throughput and low latency via optimizations like tensor parallelism, flash-attention, paged attention, SGMV, quantization, and token streaming.
  • Offers production-ready features including Docker images, Kubernetes Helm charts, Prometheus metrics, and OpenTelemetry integration.
  • Provides an OpenAI-compatible API for multi-turn chat and dynamic adapter selection.

Maintenance & Community

Licensing & Compatibility

  • License: Apache 2.0.
  • Free for commercial use.

Limitations & Caveats

  • Requires NVIDIA GPUs (Ampere generation or newer) and specific CUDA versions, limiting hardware compatibility.
  • The project is forked from text-generation-inference v0.9.4, implying potential divergence or reliance on that version's specific features and bug fixes.
Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
1
Star History
392 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
2 more.

S-LoRA by S-LoRA

0.1%
2k
System for scalable LoRA adapter serving
created 1 year ago
updated 1 year ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

dynamo by ai-dynamo

1.1%
5k
Inference framework for distributed generative AI model serving
created 5 months ago
updated 21 hours ago
Feedback? Help us improve.