lorax by predibase

Multi-LoRA inference server for serving 1000s of fine-tuned LLMs

Created 2 years ago

3,666 stars

Top 13.1% on SourcePulse

View on GitHub

10 Experts Love This Project

Jason Knight

Director AI Compilers at NVIDIA; Cofounder of OctoML

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Jeff Hammerbacher

Cofounder of Cloudera

Junyang Lin

Core Maintainer at Alibaba Qwen

and 6 more!

Project Summary

LoRAX is an inference server designed to serve thousands of fine-tuned LoRA adapters on a single GPU, significantly reducing serving costs for LLMs. It targets developers and researchers needing to deploy multiple specialized LLM variants efficiently, offering high throughput and low latency.

How It Works

LoRAX employs dynamic adapter loading, allowing adapters from HuggingFace or local filesystems to be loaded on-demand per request without blocking. It features heterogeneous continuous batching to group requests for different adapters within the same batch, maintaining consistent performance. The system also includes adapter exchange scheduling for offloading between GPU and CPU memory and prefetching, optimizing aggregate throughput.

Quick Start & Requirements

Install/Run: Recommended via pre-built Docker image.

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/predibase/lorax:main --model-id $model

Prerequisites: NVIDIA GPU (Ampere+), CUDA 11.8+, Linux, Docker, nvidia-container-toolkit.
Resources: Requires GPU memory for the base model and dynamic loading of adapters.
Docs: Getting Started - Docker, REST API, Python Client, OpenAI Compatible API.

Highlighted Details

Supports dynamic loading and merging of LoRA adapters from HuggingFace or local filesystems.
Achieves high throughput and low latency via optimizations like tensor parallelism, flash-attention, paged attention, SGMV, quantization, and token streaming.
Offers production-ready features including Docker images, Kubernetes Helm charts, Prometheus metrics, and OpenTelemetry integration.
Provides an OpenAI-compatible API for multi-turn chat and dynamic adapter selection.

Maintenance & Community

Built upon HuggingFace's text-generation-inference (forked from v0.9.4).
Discord community available: https://discord.gg/CBgdrGnZjy
Roadmap tracked here.

Licensing & Compatibility

License: Apache 2.0.
Free for commercial use.

Limitations & Caveats

Requires NVIDIA GPUs (Ampere generation or newer) and specific CUDA versions, limiting hardware compatibility.
The project is forked from text-generation-inference v0.9.4, implying potential divergence or reliance on that version's specific features and bug fixes.

Health Check

Last Commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

110 stars in the last 30 days