tabbyAPI by theroyallab

FastAPI application for LLM text generation using Exllamav2

Created 2 years ago

1,109 stars

Top 34.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Andy Konwinski

Cofounder of Perplexity, Databricks

Project Summary

TabbyAPI provides an OpenAI-compatible API server for Exllamav2, enabling efficient text generation with large language models. It targets users who need a lightweight, fast, and flexible backend for LLM inference, particularly those familiar with Exllamav2 or seeking an alternative to more complex solutions. The primary benefit is a streamlined, OAI-standard interface for interacting with powerful LLM backends.

How It Works

TabbyAPI leverages the Exllamav2 library as its core inference engine, known for its speed and efficiency, especially with Exl2 quantized models. It utilizes FastAPI for its asynchronous web framework, allowing for concurrent inference requests via asyncio. Key architectural choices include support for paged attention on modern NVIDIA GPUs (Ampere and higher) for parallel batching, and a flexible Jinja2 templating engine for chat completions, enhancing its adaptability to various LLM interaction patterns.

Quick Start & Requirements

Install: pip install -r requirements.txt (from source) or via Docker.
Prerequisites: NVIDIA GPU (Ampere+ recommended for parallel batching), CUDA, Python 3.10+.
Resources: Requires downloading LLM models.
Docs: Wiki

Highlighted Details

OpenAI compatible API
Supports Exl2, GPTQ, and FP16 models via Exllamav2
Features speculative decoding, multi-LoRA, and continuous batching
Includes AI Horde integration and OAI-style tool/function calling

Maintenance & Community

Active development by creators/developers kingbri, Splice86, and Turboderp.
Community support via Discord Server.

Licensing & Compatibility

License: MIT.
Compatibility: Suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

This project is a hobbyist endeavor, not intended for production servers, and may undergo breaking changes requiring dependency reinstallation. It is specifically designed for Exllamav2 backends; GGUF models are handled by a sister project, YALS.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days