Kitten-TTS-Server  by devnen

Lightweight, high-performance Text-to-Speech server

Created 8 months ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a self-hostable, high-performance API server and Web UI for the lightweight KittenTTS text-to-speech models. It addresses the need for an efficient, realistic TTS solution that can run on diverse hardware, from powerful servers with NVIDIA GPUs to resource-constrained edge devices like the Raspberry Pi 5. The primary benefit is offering a user-friendly, production-ready TTS engine with enhanced features like GPU acceleration and large text processing for audiobooks, significantly improving upon the base KittenTTS model.

How It Works

The server leverages KittenTTS models, ranging from 15M to 80M parameters, running via ONNX for maximum portability. It utilizes a FastAPI backend and implements an optimized inference pipeline using onnxruntime-gpu and GPU I/O binding for NVIDIA GPUs, drastically reducing latency. For long texts, it intelligently splits them into manageable chunks, processes them sequentially, and seamlessly concatenates the resulting audio, making it suitable for audiobook generation. The approach prioritizes performance and efficiency, enabling real-time synthesis even on limited hardware.

Quick Start & Requirements

  • Primary install/run: Clone the repository, set up a Python virtual environment, install dependencies (pip install -r requirements.txt for CPU, pip install -r requirements-nvidia.txt for NVIDIA GPU), and run python server.py. Docker Compose (docker compose up -d --build or docker compose -f docker-compose-cpu.yml up -d --build) is also recommended for easier deployment.
  • Prerequisites: Python 3.10+, Git, eSpeak NG (essential for phonemization, requires separate installation and terminal restart on Windows), and for GPU acceleration: an NVIDIA GPU with CUDA support, onnxruntime-gpu, and PyTorch with CUDA 12.1. Linux/RPi requires libsndfile1 and ffmpeg.
  • Links: Repository: https://github.com/devnen/Kitten-TTS-Server.git, API Docs: http://localhost:8005/docs (after running).

Highlighted Details

  • Full support for all 7 KittenTTS models (Nano, Micro, Mini; v0.1/v0.2 and v0.8) with hot-swappable model switching directly from the Web UI.
  • Features up to 8 named voices per model (e.g., Bella, Jasper, Luna), with automatic voice list updates upon model switching.
  • Robust large text processing and audiobook generation capabilities, handling long inputs seamlessly.
  • Offers both a primary /tts endpoint for full control and an OpenAI-compatible /v1/audio/speech endpoint for easy integration.
  • Optimized for edge devices like Raspberry Pi 5, providing real-time performance on resource-constrained hardware.
  • True NVIDIA GPU acceleration via an optimized ONNX Runtime pipeline with GPU I/O binding for minimal latency.

Maintenance & Community

The README does not detail specific maintainers, sponsorships, or community channels like Discord or Slack. Contributions are welcomed via GitHub issues and pull requests.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: The permissive MIT license allows for broad compatibility, including commercial use and integration into closed-source applications.

Limitations & Caveats

GPU acceleration is strictly limited to NVIDIA hardware with CUDA. The installation of eSpeak NG can be a common point of failure if not performed correctly, particularly on Windows. Compilation of certain Python packages during installation on ARM architectures (like Raspberry Pi) may take a considerable amount of time (15-30 minutes).

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
1
Star History
9 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.