Kokoro-FastAPI  by remsky

FastAPI wrapper for Kokoro-82M text-to-speech model

created 7 months ago
3,366 stars

Top 14.8% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a Dockerized FastAPI wrapper for the Kokoro-82M text-to-speech model, offering an OpenAI-compatible API for generating speech. It targets developers and researchers needing a flexible TTS solution with support for both NVIDIA GPU and CPU inference, multi-language capabilities, and advanced features like voice mixing and per-word timestamps.

How It Works

The wrapper leverages FastAPI to expose an API that interfaces with the Kokoro-82M model. It supports both PyTorch for NVIDIA GPU acceleration and ONNX for CPU inference (with ONNX support noted as upcoming). The architecture is designed for efficient inference, including features for handling long texts by automatically splitting and stitching audio at sentence boundaries, and supports streaming audio output.

Quick Start & Requirements

  • Install/Run: Docker is the primary method.
    • CPU: docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest
    • GPU: docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest
  • Prerequisites: Docker, NVIDIA Container Toolkit (for GPU). Apple Silicon users must use the CPU setup. espeak-ng is recommended for fallback word handling.
  • Setup: Models auto-download. Docker Compose setup is available.
  • Docs: API documentation available at http://localhost:8880/docs.

Highlighted Details

  • OpenAI-compatible Speech endpoint.
  • Weighted voice combinations (e.g., af_bella(2)+af_sky(1)).
  • Per-word timestamped caption generation.
  • Streaming support with low latency metrics (e.g., ~300ms GPU first token).

Maintenance & Community

This is a community-driven project. Support is available through contributions, bug reports, and feature requests. The release branch is for stable builds, while master is for active development.

Licensing & Compatibility

The project and the Kokoro-82M model weights are licensed under Apache License 2.0. The inference code adapted from StyleTTS2 is MIT licensed. This is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is described as "development-focused," and users may need to troubleshoot or roll back versions if issues arise. GPU support is limited to NVIDIA hardware with CUDA; Apple Silicon (MPS) support is planned but not yet available. Text normalization can sometimes alter input phrases, though it can be disabled.

Health Check
Last commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
8
Issues (30d)
12
Star History
884 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Andre Zayarni Andre Zayarni(Cofounder of Qdrant), and
2 more.

RealChar by Shaunwei

0.1%
6k
Real-time AI character/companion creation and interaction codebase
created 2 years ago
updated 1 year ago
Feedback? Help us improve.