tabbyAPI  by theroyallab

FastAPI application for LLM text generation using Exllamav2

Created 1 year ago
1,049 stars

Top 35.9% on SourcePulse

GitHubView on GitHub
Project Summary

TabbyAPI provides an OpenAI-compatible API server for Exllamav2, enabling efficient text generation with large language models. It targets users who need a lightweight, fast, and flexible backend for LLM inference, particularly those familiar with Exllamav2 or seeking an alternative to more complex solutions. The primary benefit is a streamlined, OAI-standard interface for interacting with powerful LLM backends.

How It Works

TabbyAPI leverages the Exllamav2 library as its core inference engine, known for its speed and efficiency, especially with Exl2 quantized models. It utilizes FastAPI for its asynchronous web framework, allowing for concurrent inference requests via asyncio. Key architectural choices include support for paged attention on modern NVIDIA GPUs (Ampere and higher) for parallel batching, and a flexible Jinja2 templating engine for chat completions, enhancing its adaptability to various LLM interaction patterns.

Quick Start & Requirements

  • Install: pip install -r requirements.txt (from source) or via Docker.
  • Prerequisites: NVIDIA GPU (Ampere+ recommended for parallel batching), CUDA, Python 3.10+.
  • Resources: Requires downloading LLM models.
  • Docs: Wiki

Highlighted Details

  • OpenAI compatible API
  • Supports Exl2, GPTQ, and FP16 models via Exllamav2
  • Features speculative decoding, multi-LoRA, and continuous batching
  • Includes AI Horde integration and OAI-style tool/function calling

Maintenance & Community

  • Active development by creators/developers kingbri, Splice86, and Turboderp.
  • Community support via Discord Server.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

This project is a hobbyist endeavor, not intended for production servers, and may undergo breaking changes requiring dependency reinstallation. It is specifically designed for Exllamav2 backends; GGUF models are handled by a sister project, YALS.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
3
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
4 more.

LongLoRA by dvlab-research

0.1%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.