BLoRA  by sabetAI

Inference optimization for batched LoRA adapters

Created 2 years ago
345 stars

Top 80.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a method for batching multiple LoRA (Low-Rank Adaptation) adapters for simultaneous inference with a single base model. It targets users of large language models who want to leverage multiple specialized LoRAs without the overhead of loading separate models, thereby maximizing GPU utilization and inference throughput.

How It Works

BLoRA leverages the additive nature of LoRA operations, which are applied to specific layers of a base model. By broadcasting and applying multiple LoRA adapters concurrently within a single batch, it allows for parallel inference across different adapter configurations. This approach is advantageous as it avoids the need to load multiple model instances, keeping trainable parameters small and manageable within VRAM.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after cloning the repository.
  • Requires Hugging Face Transformers and PEFT.
  • Example usage demonstrates loading a Llama base model and injecting LoRA checkpoints.

Highlighted Details

  • Enables simultaneous inference across multiple LoRA adapters on a single base model.
  • Maximizes GPU utilization by batching inference requests.
  • LoRA adapters are loaded and managed efficiently within VRAM.
  • Demonstrates a "hacky" method for side-loading LoRA batch IDs into the model for parallel processing.

Maintenance & Community

  • Project appears to be a personal or small-team effort, with acknowledgments to @yacineMTB for review.
  • No explicit community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The repository's default license is likely MIT unless otherwise specified.
  • Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The method for preparing batches involves a "hacky" side-loading of LoRA identifiers, which may indicate potential instability or future breaking changes. The README does not specify supported base models beyond Llama or detail performance benchmarks.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

punica by punica-ai

0.2%
1k
LoRA serving system (research paper) for multi-tenant LLM inference
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.2%
2k
System for scalable LoRA adapter serving
Created 1 year ago
Updated 1 year ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Nikola Borisov Nikola Borisov(Founder and CEO of DeepInfra), and
3 more.

tensorrtllm_backend by triton-inference-server

0.2%
889
Triton backend for serving TensorRT-LLM models
Created 2 years ago
Updated 1 day ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
8 more.

lorax by predibase

0.2%
3k
Multi-LoRA inference server for serving 1000s of fine-tuned LLMs
Created 1 year ago
Updated 4 months ago
Feedback? Help us improve.