lorax  by predibase

Multi-LoRA inference server for serving 1000s of fine-tuned LLMs

Created 2 years ago
3,745 stars

Top 12.9% on SourcePulse

GitHubView on GitHub
Project Summary

LoRAX is an inference server designed to serve thousands of fine-tuned LoRA adapters on a single GPU, significantly reducing serving costs for LLMs. It targets developers and researchers needing to deploy multiple specialized LLM variants efficiently, offering high throughput and low latency.

How It Works

LoRAX employs dynamic adapter loading, allowing adapters from HuggingFace or local filesystems to be loaded on-demand per request without blocking. It features heterogeneous continuous batching to group requests for different adapters within the same batch, maintaining consistent performance. The system also includes adapter exchange scheduling for offloading between GPU and CPU memory and prefetching, optimizing aggregate throughput.

Quick Start & Requirements

  • Install/Run: Recommended via pre-built Docker image.
    model=mistralai/Mistral-7B-Instruct-v0.1
    volume=$PWD/data
    docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
        ghcr.io/predibase/lorax:main --model-id $model
    
  • Prerequisites: NVIDIA GPU (Ampere+), CUDA 11.8+, Linux, Docker, nvidia-container-toolkit.
  • Resources: Requires GPU memory for the base model and dynamic loading of adapters.
  • Docs: Getting Started - Docker, REST API, Python Client, OpenAI Compatible API.

Highlighted Details

  • Supports dynamic loading and merging of LoRA adapters from HuggingFace or local filesystems.
  • Achieves high throughput and low latency via optimizations like tensor parallelism, flash-attention, paged attention, SGMV, quantization, and token streaming.
  • Offers production-ready features including Docker images, Kubernetes Helm charts, Prometheus metrics, and OpenTelemetry integration.
  • Provides an OpenAI-compatible API for multi-turn chat and dynamic adapter selection.

Maintenance & Community

Licensing & Compatibility

  • License: Apache 2.0.
  • Free for commercial use.

Limitations & Caveats

  • Requires NVIDIA GPUs (Ampere generation or newer) and specific CUDA versions, limiting hardware compatibility.
  • The project is forked from text-generation-inference v0.9.4, implying potential divergence or reliance on that version's specific features and bug fixes.
Health Check
Last Commit

10 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Matthew Johnson Matthew Johnson(Coauthor of JAX; Research Scientist at Google Brain), Roy Frostig Roy Frostig(Coauthor of JAX; Research Scientist at Google DeepMind), and
3 more.

sglang-jax by sgl-project

1.5%
264
High-performance LLM inference engine for JAX/TPU serving
Created 8 months ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.2%
2k
System for scalable LoRA adapter serving
Created 2 years ago
Updated 2 years ago
Feedback? Help us improve.