kitten_tts_rs by second-state

Lightweight, high-quality text-to-speech in Rust

Created 3 months ago

308 stars

Top 87.0% on SourcePulse

Project Summary

This project provides a high-performance, lightweight Text-to-Speech (TTS) system implemented in Rust, offering a self-contained alternative to Python-based solutions like the original KittenTTS. It targets developers building AI agent skills, real-time audio applications, or deploying TTS on resource-constrained devices, delivering high-quality voice synthesis with minimal overhead and fast startup times.

How It Works

The core of kitten_tts_rs is a Rust port of the KittenTTS models, leveraging ONNX for CPU-optimized inference. It processes input text through normalization, phonemization (using espeak-ng), and token encoding before feeding it into the ONNX runtime. The implementation provides two distinct binaries: a command-line interface (CLI) for direct audio generation and an OpenAI-compatible API server for integration into applications. This Rust-native approach eliminates Python dependencies, drastically reducing binary size and improving startup performance. Optional GPU acceleration via Cargo features (CUDA, TensorRT, CoreML, DirectML) is also supported.

Quick Start & Requirements

Installation involves downloading pre-built binaries and model weights from the project's releases and Hugging Face, respectively. A system-level installation of espeak-ng is required for phonemization. The core binary is approximately 10MB, with model weights ranging from 25MB to 80MB. Official quick-start instructions and download links are provided within the README.

Highlighted Details

Ultra-lightweight models (15M-80M parameters, 25-80MB disk) optimized for CPU inference.
Provides both a CLI tool and an OpenAI-compatible API server with SSE streaming.
Features 8 built-in voices, adjustable speech speed, and text preprocessing.
Achieves fast startup times (~100ms) and a tiny binary footprint (~10MB).
Supports optional GPU acceleration via Cargo features for CUDA, TensorRT, CoreML, and DirectML.

Maintenance & Community

The project acknowledges contributions from KittenML, pyke/ort, and espeak-ng. Specific details regarding active maintainers, community channels (like Discord or Slack), or a public roadmap are not detailed in the provided README.

Licensing & Compatibility

The project is licensed under the Apache-2.0 license, consistent with the original KittenTTS. This permissive license allows for commercial use and integration into closed-source applications without significant restrictions.

Limitations & Caveats

While CoreML acceleration is available for Apple Silicon, benchmarks indicate it can be slower than CPU-only inference for smaller KittenTTS models and has limitations with dynamic tensor shapes. The AAC audio format is not yet supported. GPU acceleration requires specific build features and corresponding system-level SDKs.

kitten_tts_rs by second-state

Explore Similar Projects

audio.cpp by 0xShug0

Kitten-TTS-Server by devnen

tiny-tts by tronghieuit

Auralis by astramind-ai

TheWhisper by TheStageAI

Kokoros by lucasjinreal

xtts-api-server by daswer123

sherpa-ncnn by k2-fsa

Chatterbox-TTS-Server by devnen

RealtimeVoiceChat by KoljaB

pocket-tts by kyutai-labs

tortoise-tts by neonbjb