Kitten-TTS-Server by devnen

Lightweight, high-performance Text-to-Speech server

Created 11 months ago

267 stars

Top 95.7% on SourcePulse

Project Summary

This project provides a self-hostable, high-performance API server and Web UI for the lightweight KittenTTS text-to-speech models. It addresses the need for an efficient, realistic TTS solution that can run on diverse hardware, from powerful servers with NVIDIA GPUs to resource-constrained edge devices like the Raspberry Pi 5. The primary benefit is offering a user-friendly, production-ready TTS engine with enhanced features like GPU acceleration and large text processing for audiobooks, significantly improving upon the base KittenTTS model.

How It Works

The server leverages KittenTTS models, ranging from 15M to 80M parameters, running via ONNX for maximum portability. It utilizes a FastAPI backend and implements an optimized inference pipeline using onnxruntime-gpu and GPU I/O binding for NVIDIA GPUs, drastically reducing latency. For long texts, it intelligently splits them into manageable chunks, processes them sequentially, and seamlessly concatenates the resulting audio, making it suitable for audiobook generation. The approach prioritizes performance and efficiency, enabling real-time synthesis even on limited hardware.

Quick Start & Requirements

Primary install/run: Clone the repository, set up a Python virtual environment, install dependencies (pip install -r requirements.txt for CPU, pip install -r requirements-nvidia.txt for NVIDIA GPU), and run python server.py. Docker Compose (docker compose up -d --build or docker compose -f docker-compose-cpu.yml up -d --build) is also recommended for easier deployment.
Prerequisites: Python 3.10+, Git, eSpeak NG (essential for phonemization, requires separate installation and terminal restart on Windows), and for GPU acceleration: an NVIDIA GPU with CUDA support, onnxruntime-gpu, and PyTorch with CUDA 12.1. Linux/RPi requires libsndfile1 and ffmpeg.
Links: Repository: https://github.com/devnen/Kitten-TTS-Server.git, API Docs: http://localhost:8005/docs (after running).

Highlighted Details

Full support for all 7 KittenTTS models (Nano, Micro, Mini; v0.1/v0.2 and v0.8) with hot-swappable model switching directly from the Web UI.
Features up to 8 named voices per model (e.g., Bella, Jasper, Luna), with automatic voice list updates upon model switching.
Robust large text processing and audiobook generation capabilities, handling long inputs seamlessly.
Offers both a primary /tts endpoint for full control and an OpenAI-compatible /v1/audio/speech endpoint for easy integration.
Optimized for edge devices like Raspberry Pi 5, providing real-time performance on resource-constrained hardware.
True NVIDIA GPU acceleration via an optimized ONNX Runtime pipeline with GPU I/O binding for minimal latency.

Maintenance & Community

The README does not detail specific maintainers, sponsorships, or community channels like Discord or Slack. Contributions are welcomed via GitHub issues and pull requests.

Licensing & Compatibility

License: MIT License.
Compatibility: The permissive MIT license allows for broad compatibility, including commercial use and integration into closed-source applications.

Limitations & Caveats

GPU acceleration is strictly limited to NVIDIA hardware with CUDA. The installation of eSpeak NG can be a common point of failure if not performed correctly, particularly on Windows. Compilation of certain Python packages during installation on ARM architectures (like Raspberry Pi) may take a considerable amount of time (15-30 minutes).

Kitten-TTS-Server by devnen

Explore Similar Projects

kitten_tts_rs by second-state

onnx-asr by istupakov

MiraTTS by ysharma3501

tiny-tts by tronghieuit

TheWhisper by TheStageAI

qwen3-tts-apple-silicon by kapi2800

soprano by ekwek1

Genie-TTS by High-Logic

faster-qwen3-tts by andimarafioti

mlx-audio by Blaizzy

supertonic by supertone-inc

tortoise-tts by neonbjb