FastAPI wrapper for Kokoro-82M text-to-speech model
Top 14.8% on sourcepulse
This project provides a Dockerized FastAPI wrapper for the Kokoro-82M text-to-speech model, offering an OpenAI-compatible API for generating speech. It targets developers and researchers needing a flexible TTS solution with support for both NVIDIA GPU and CPU inference, multi-language capabilities, and advanced features like voice mixing and per-word timestamps.
How It Works
The wrapper leverages FastAPI to expose an API that interfaces with the Kokoro-82M model. It supports both PyTorch for NVIDIA GPU acceleration and ONNX for CPU inference (with ONNX support noted as upcoming). The architecture is designed for efficient inference, including features for handling long texts by automatically splitting and stitching audio at sentence boundaries, and supports streaming audio output.
Quick Start & Requirements
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest
espeak-ng
is recommended for fallback word handling.http://localhost:8880/docs
.Highlighted Details
af_bella(2)+af_sky(1)
).Maintenance & Community
This is a community-driven project. Support is available through contributions, bug reports, and feature requests. The release
branch is for stable builds, while master
is for active development.
Licensing & Compatibility
The project and the Kokoro-82M model weights are licensed under Apache License 2.0. The inference code adapted from StyleTTS2 is MIT licensed. This is permissive for commercial use and integration into closed-source projects.
Limitations & Caveats
The project is described as "development-focused," and users may need to troubleshoot or roll back versions if issues arise. GPU support is limited to NVIDIA hardware with CUDA; Apple Silicon (MPS) support is planned but not yet available. Text normalization can sometimes alter input phrases, though it can be disabled.
4 days ago
1 day