tortoise-tts  by neonbjb

Multi-voice TTS system emphasizing quality, realistic prosody

created 3 years ago
14,464 stars

Top 3.5% on sourcepulse

GitHubView on GitHub
Project Summary

Tortoise TTS is a high-quality, multi-voice text-to-speech system designed for realistic prosody and intonation. It targets researchers and developers needing advanced TTS capabilities, offering a significant improvement in naturalness over standard TTS models.

How It Works

Tortoise TTS employs a dual-decoder architecture, combining an autoregressive decoder with a diffusion decoder. This approach allows for highly detailed and natural-sounding speech generation, capturing nuances in intonation and prosody. The model is trained for quality, prioritizing realistic voice output.

Quick Start & Requirements

  • Install: pip install tortoise-tts or pip install git+https://github.com/neonbjb/tortoise-tts
  • Prerequisites: NVIDIA GPU (CUDA 11.7+ recommended), Python 3.9+. Conda installation is highly recommended for Windows to manage dependencies. Apple Silicon requires PyTorch nightly builds.
  • Setup: Local installation involves cloning the repo, setting up a Conda environment, installing PyTorch, and running python setup.py install. Docker is also provided.
  • Docs: Manuscript, Hugging Face Space

Highlighted Details

  • Achieves 0.25-0.3 RTF on 4GB VRAM with streaming for < 500ms latency.
  • Supports multiple voices and programmatic API usage.
  • Offers presets for faster inference (fast, ultra_fast).
  • Includes tools for batch processing text files (read.py, read_fast.py).

Maintenance & Community

  • Developed by James Betker; employer not involved.
  • Project appears active based on recent activity and ongoing development.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The model is noted as "insanely slow" in its initial description, though later updates claim significant speed improvements. CPU-only inference is not supported for the Hugging Face demo. DeepSpeed is disabled on Apple Silicon.

Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
1
Star History
459 stars in the last 90 days

Explore Similar Projects

Starred by Michael Han Michael Han(Cofounder of Unsloth), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

TTS by coqui-ai

0.4%
42k
Deep learning toolkit for Text-to-Speech, research-tested
created 5 years ago
updated 11 months ago
Feedback? Help us improve.