index-tts  by index-tts

Zero-shot TTS system for industrial use

created 5 months ago
4,221 stars

Top 11.8% on sourcepulse

GitHubView on GitHub
Project Summary

IndexTTS is an industrial-level zero-shot text-to-speech system designed for high-quality, controllable voice synthesis, particularly excelling in Chinese language scenarios. It targets researchers and developers seeking advanced TTS capabilities, offering state-of-the-art performance and features like pronunciation correction and precise pause control.

How It Works

IndexTTS builds upon XTTS and Tortoise, integrating a conformer conditioning encoder and a BigVGAN2-based speechcode decoder. This architecture enhances training stability, speaker similarity, and audio quality. A key innovation is its character-pinyin hybrid modeling for accurate Chinese pronunciation, alongside punctuation-based pause control.

Quick Start & Requirements

  • Install: pip install -r requirements.txt and pip install -e . for CLI usage.
  • Prerequisites: Python 3.10, ffmpeg, PyTorch. Model weights must be downloaded to a checkpoints directory.
  • Setup: Requires downloading model checkpoints (approx. 1.5GB).
  • Demos: HuggingFace (link), ModelScope (link).
  • Web UI: pip install -e ".[webui]" && python webui.py

Highlighted Details

  • Achieves state-of-the-art performance, outperforming popular TTS systems like XTTS, CosyVoice2, and F5-TTS in benchmarks.
  • Offers zero-shot voice cloning capabilities.
  • Features a character-pinyin hybrid approach for improved Chinese pronunciation.
  • Enables fine-grained control over pauses using punctuation.

Maintenance & Community

  • Released model parameters and inference code on March 25, 2025.
  • Paper submitted to arXiv on February 12, 2025.
  • Community channels include QQ group (553460296) and Discord (link).

Licensing & Compatibility

  • The repository does not explicitly state a license. Model weights are available via HuggingFace and ModelScope.

Limitations & Caveats

  • Windows users may face issues installing pynini, requiring a conda installation.
  • Contact is required for more detailed information, suggesting potential limitations in public documentation or support.
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
25
Star History
2,797 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Lianmin Zheng Lianmin Zheng(Author of SGLang).

fish-speech by fishaudio

0.3%
23k
Open-source TTS for multilingual speech synthesis
created 1 year ago
updated 1 week ago
Feedback? Help us improve.