tortoise-tts by neonbjb

Multi-voice TTS system emphasizing quality, realistic prosody

Created 4 years ago

14,814 stars

Top 3.3% on SourcePulse

View on GitHub

12 Experts Love This Project

Benjamin Bolte

Cofounder of K-Scale Labs

Travis Fischer

Founder of Agentic

Vincent Weisser

Cofounder of Prime Intellect

Shawn Wang

Editor of Latent Space

and 8 more!

Project Summary

Tortoise TTS is a high-quality, multi-voice text-to-speech system designed for realistic prosody and intonation. It targets researchers and developers needing advanced TTS capabilities, offering a significant improvement in naturalness over standard TTS models.

How It Works

Tortoise TTS employs a dual-decoder architecture, combining an autoregressive decoder with a diffusion decoder. This approach allows for highly detailed and natural-sounding speech generation, capturing nuances in intonation and prosody. The model is trained for quality, prioritizing realistic voice output.

Quick Start & Requirements

Install: pip install tortoise-tts or pip install git+https://github.com/neonbjb/tortoise-tts
Prerequisites: NVIDIA GPU (CUDA 11.7+ recommended), Python 3.9+. Conda installation is highly recommended for Windows to manage dependencies. Apple Silicon requires PyTorch nightly builds.
Setup: Local installation involves cloning the repo, setting up a Conda environment, installing PyTorch, and running python setup.py install. Docker is also provided.
Docs: Manuscript, Hugging Face Space

Highlighted Details

Achieves 0.25-0.3 RTF on 4GB VRAM with streaming for < 500ms latency.
Supports multiple voices and programmatic API usage.
Offers presets for faster inference (fast, ultra_fast).
Includes tools for batch processing text files (read.py, read_fast.py).

Maintenance & Community

Developed by James Betker; employer not involved.
Project appears active based on recent activity and ongoing development.

Licensing & Compatibility

Licensed under Apache 2.0.
Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The model is noted as "insanely slow" in its initial description, though later updates claim significant speed improvements. CPU-only inference is not supported for the Hugging Face demo. DeepSpeed is disabled on Apple Silicon.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

41 stars in the last 30 days