Zero-shot TTS system for industrial use
Top 11.8% on sourcepulse
IndexTTS is an industrial-level zero-shot text-to-speech system designed for high-quality, controllable voice synthesis, particularly excelling in Chinese language scenarios. It targets researchers and developers seeking advanced TTS capabilities, offering state-of-the-art performance and features like pronunciation correction and precise pause control.
How It Works
IndexTTS builds upon XTTS and Tortoise, integrating a conformer conditioning encoder and a BigVGAN2-based speechcode decoder. This architecture enhances training stability, speaker similarity, and audio quality. A key innovation is its character-pinyin hybrid modeling for accurate Chinese pronunciation, alongside punctuation-based pause control.
Quick Start & Requirements
pip install -r requirements.txt
and pip install -e .
for CLI usage.ffmpeg
, PyTorch. Model weights must be downloaded to a checkpoints
directory.pip install -e ".[webui]" && python webui.py
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
pynini
, requiring a conda
installation.1 month ago
1 day