tabbyAPI  by theroyallab

FastAPI application for LLM text generation using Exllamav2

created 1 year ago
1,016 stars

Top 37.5% on sourcepulse

GitHubView on GitHub
Project Summary

TabbyAPI provides an OpenAI-compatible API server for Exllamav2, enabling efficient text generation with large language models. It targets users who need a lightweight, fast, and flexible backend for LLM inference, particularly those familiar with Exllamav2 or seeking an alternative to more complex solutions. The primary benefit is a streamlined, OAI-standard interface for interacting with powerful LLM backends.

How It Works

TabbyAPI leverages the Exllamav2 library as its core inference engine, known for its speed and efficiency, especially with Exl2 quantized models. It utilizes FastAPI for its asynchronous web framework, allowing for concurrent inference requests via asyncio. Key architectural choices include support for paged attention on modern NVIDIA GPUs (Ampere and higher) for parallel batching, and a flexible Jinja2 templating engine for chat completions, enhancing its adaptability to various LLM interaction patterns.

Quick Start & Requirements

  • Install: pip install -r requirements.txt (from source) or via Docker.
  • Prerequisites: NVIDIA GPU (Ampere+ recommended for parallel batching), CUDA, Python 3.10+.
  • Resources: Requires downloading LLM models.
  • Docs: Wiki

Highlighted Details

  • OpenAI compatible API
  • Supports Exl2, GPTQ, and FP16 models via Exllamav2
  • Features speculative decoding, multi-LoRA, and continuous batching
  • Includes AI Horde integration and OAI-style tool/function calling

Maintenance & Community

  • Active development by creators/developers kingbri, Splice86, and Turboderp.
  • Community support via Discord Server.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

This project is a hobbyist endeavor, not intended for production servers, and may undergo breaking changes requiring dependency reinstallation. It is specifically designed for Exllamav2 backends; GGUF models are handled by a sister project, YALS.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
6
Star History
92 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 9 hours ago
Feedback? Help us improve.