FastAPI application for LLM text generation using Exllamav2
Top 37.5% on sourcepulse
TabbyAPI provides an OpenAI-compatible API server for Exllamav2, enabling efficient text generation with large language models. It targets users who need a lightweight, fast, and flexible backend for LLM inference, particularly those familiar with Exllamav2 or seeking an alternative to more complex solutions. The primary benefit is a streamlined, OAI-standard interface for interacting with powerful LLM backends.
How It Works
TabbyAPI leverages the Exllamav2 library as its core inference engine, known for its speed and efficiency, especially with Exl2 quantized models. It utilizes FastAPI for its asynchronous web framework, allowing for concurrent inference requests via asyncio. Key architectural choices include support for paged attention on modern NVIDIA GPUs (Ampere and higher) for parallel batching, and a flexible Jinja2 templating engine for chat completions, enhancing its adaptability to various LLM interaction patterns.
Quick Start & Requirements
pip install -r requirements.txt
(from source) or via Docker.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
This project is a hobbyist endeavor, not intended for production servers, and may undergo breaking changes requiring dependency reinstallation. It is specifically designed for Exllamav2 backends; GGUF models are handled by a sister project, YALS.
1 day ago
1 day