Kokoro-FastAPI by remsky

FastAPI wrapper for Kokoro-82M text-to-speech model

Created 1 year ago

4,250 stars

Top 11.5% on SourcePulse

Project Summary

This project provides a Dockerized FastAPI wrapper for the Kokoro-82M text-to-speech model, offering an OpenAI-compatible API for generating speech. It targets developers and researchers needing a flexible TTS solution with support for both NVIDIA GPU and CPU inference, multi-language capabilities, and advanced features like voice mixing and per-word timestamps.

How It Works

The wrapper leverages FastAPI to expose an API that interfaces with the Kokoro-82M model. It supports both PyTorch for NVIDIA GPU acceleration and ONNX for CPU inference (with ONNX support noted as upcoming). The architecture is designed for efficient inference, including features for handling long texts by automatically splitting and stitching audio at sentence boundaries, and supports streaming audio output.

Quick Start & Requirements

Install/Run: Docker is the primary method.
- CPU: docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest
- GPU: docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest
Prerequisites: Docker, NVIDIA Container Toolkit (for GPU). Apple Silicon users must use the CPU setup. espeak-ng is recommended for fallback word handling.
Setup: Models auto-download. Docker Compose setup is available.
Docs: API documentation available at http://localhost:8880/docs.

Highlighted Details

OpenAI-compatible Speech endpoint.
Weighted voice combinations (e.g., af_bella(2)+af_sky(1)).
Per-word timestamped caption generation.
Streaming support with low latency metrics (e.g., ~300ms GPU first token).

Maintenance & Community

This is a community-driven project. Support is available through contributions, bug reports, and feature requests. The release branch is for stable builds, while master is for active development.

Licensing & Compatibility

The project and the Kokoro-82M model weights are licensed under Apache License 2.0. The inference code adapted from StyleTTS2 is MIT licensed. This is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is described as "development-focused," and users may need to troubleshoot or roll back versions if issues arise. GPU support is limited to NVIDIA hardware with CUDA; Apple Silicon (MPS) support is planned but not yet available. Text normalization can sometimes alter input phrases, though it can be disabled.

Kokoro-FastAPI by remsky

Explore Similar Projects

orate by haydenbleasel

echogarden by echogarden-project

LLaSM by LinkSoul-AI

sesame_csm_openai by phildougherty

Scriberr by rishikanthc

tts by zuoban

ichigo by janhq

easyVoice by cosin2077

whisper-asr-webservice by ahmetoner

Zonos by Zyphra

piper by rhasspy

openai-fm by openai