RealtimeTTS by KoljaB

Realtime TTS library for low-latency text-to-speech conversion

Created 2 years ago

3,708 stars

Top 12.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Travis Fischer

Founder of Agentic

Project Summary

RealtimeTTS is a Python library for low-latency, high-quality text-to-speech (TTS) conversion, targeting developers building real-time applications, LLM integrations, and voice assistants. It offers near-instantaneous speech generation from text streams, supporting numerous TTS engines for flexibility and robustness.

How It Works

The library processes text input, splitting it into sentences using NLTK or Stanza tokenizers. It then feeds these sentences to a configurable TTS engine (e.g., OpenAI, ElevenLabs, Coqui, Piper) for synthesis. An audio stream manager handles playback, with options for asynchronous or synchronous operation, and includes features like silence insertion between sentences and callback hooks for monitoring progress. Its key advantage is the ability to switch between multiple TTS engines, providing a fallback mechanism for continuous operation.

Quick Start & Requirements

Install: pip install -U realtimetts[all] (recommended for full functionality) or pip install realtimetts[engine_name] for specific engines.
Prerequisites: Python >= 3.9, < 3.13. Specific engines may require API keys (OpenAI, ElevenLabs, Azure), external installations (ffmpeg, mpv), or significant VRAM (Coqui TTS requires 4-5 GB VRAM for real-time inference). CUDA support is recommended for performance-intensive engines.
Setup: Basic setup is quick via pip. Engine-specific setup (API keys, external dependencies) can add time.
Docs: RealtimeTTS Documentation

Highlighted Details

Supports a wide array of TTS engines including OpenAI, ElevenLabs, Azure, Coqui TTS, Piper, and gTTS.
Features a robust fallback mechanism to switch between engines if one fails.
Offers fine-grained control over audio playback, including silence durations and buffer thresholds.
Includes example scripts for LLM integration, voice interfaces, and real-time translation.

Maintenance & Community

The project is actively maintained by Kolja Beigel. Links to community resources like Discord or Slack are not explicitly provided in the README.

Licensing & Compatibility

The library itself is open-source. However, the usage of many integrated TTS engines (e.g., Coqui, ElevenLabs, Azure, OpenAI) has restrictions, particularly for commercial use, often requiring paid plans. System TTS (Mozilla Public License 2.0/LGPL 3.0) and gTTS (MIT) are more permissive. Users must consult individual engine licenses for commercial compatibility.

Limitations & Caveats

The README notes that some engines, like ParlerEngine, require specific, potentially complex installations (e.g., flash attention, specific PyTorch/CUDA versions) and may only run in near real-time on high-end GPUs (e.g., RTX 4090). Commercial use is heavily dependent on the chosen TTS engine's licensing terms.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

46 stars in the last 30 days