tortoise-tts  by neonbjb

Multi-voice TTS system emphasizing quality, realistic prosody

Created 4 years ago
14,832 stars

Top 3.5% on SourcePulse

GitHubView on GitHub
Project Summary

Tortoise TTS is a high-quality, multi-voice text-to-speech system designed for realistic prosody and intonation. It targets researchers and developers needing advanced TTS capabilities, offering a significant improvement in naturalness over standard TTS models.

How It Works

Tortoise TTS employs a dual-decoder architecture, combining an autoregressive decoder with a diffusion decoder. This approach allows for highly detailed and natural-sounding speech generation, capturing nuances in intonation and prosody. The model is trained for quality, prioritizing realistic voice output.

Quick Start & Requirements

  • Install: pip install tortoise-tts or pip install git+https://github.com/neonbjb/tortoise-tts
  • Prerequisites: NVIDIA GPU (CUDA 11.7+ recommended), Python 3.9+. Conda installation is highly recommended for Windows to manage dependencies. Apple Silicon requires PyTorch nightly builds.
  • Setup: Local installation involves cloning the repo, setting up a Conda environment, installing PyTorch, and running python setup.py install. Docker is also provided.
  • Docs: Manuscript, Hugging Face Space

Highlighted Details

  • Achieves 0.25-0.3 RTF on 4GB VRAM with streaming for < 500ms latency.
  • Supports multiple voices and programmatic API usage.
  • Offers presets for faster inference (fast, ultra_fast).
  • Includes tools for batch processing text files (read.py, read_fast.py).

Maintenance & Community

  • Developed by James Betker; employer not involved.
  • Project appears active based on recent activity and ongoing development.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The model is noted as "insanely slow" in its initial description, though later updates claim significant speed improvements. CPU-only inference is not supported for the Hugging Face demo. DeepSpeed is disabled on Apple Silicon.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
47 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Travis Fischer Travis Fischer(Founder of Agentic).

RealtimeTTS by KoljaB

0.4%
4k
Realtime TTS library for low-latency text-to-speech conversion
Created 2 years ago
Updated 2 days ago
Feedback? Help us improve.