indexTTS2 by iszhanjiawei

Expressive zero-shot TTS with precise duration and emotion control

Created 4 months ago

527 stars

Top 59.9% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> IndexTTS2 is an autoregressive text-to-speech (TTS) system addressing precise duration control and emotionally expressive zero-shot synthesis. It targets applications requiring synchronized audio-visual output or nuanced emotional vocalizations, offering independent control over speaker timbre and emotion for enhanced naturalness and clarity.

How It Works

This system introduces a novel, general method for duration control within autoregressive TTS, enabling both explicit token count specification for precise timing and free generation. It achieves disentanglement of speaker identity and emotional expression, allowing zero-shot reproduction of emotions from prompts or independent emotional guidance via audio or natural language descriptions. GPT latent representations enhance stability during intense emotional expressions, while a Qwen3-tuned soft instruction mechanism facilitates emotion control through text.

Quick Start & Requirements

Installation: Clone the repository, create a Python 3.10 conda environment, activate it, and install dependencies (pip install -r requirements.txt). For CLI use, pip install -e .; for the web UI, pip install -e ".[webui]".
Prerequisites: Python 3.10, PyTorch. Windows users may need conda install -c conda-forge pynini==2.1.5. Model weights must be downloaded separately (e.g., via huggingface-cli or wget).
Links: HuggingFace Demo, Paper, GitHub.

Highlighted Details

First autoregressive TTS with precise duration control (controllable/free modes).
Highly expressive emotional speech synthesis with multi-modal emotion control.
Decoupled speaker timbre and emotional features for independent manipulation.
GPT latent representations improve stability during strong emotional expressions.
Soft instruction mechanism enables emotion guidance via textual descriptions.
Claims superior performance over SOTA zero-shot TTS in WER, speaker similarity, and emotional fidelity.

Maintenance & Community

Active development with releases in 2025.
Community channels include a QQ group (1048202584) and a Discord server (https://discord.gg/uT32E7KDmy).
Contact emails: zhousiyi02@bilibili.com, zhouxun@bilibili.com, indexspeech@bilibili.com.

Licensing & Compatibility

License: Not specified in the README. This is a critical omission for adoption decisions.
Compatibility: Primarily Python-based. Windows installation may require specific conda steps for pynini. GPU acceleration is likely beneficial but not strictly required for basic inference.

Limitations & Caveats

The absence of a specified license poses a significant adoption blocker, particularly for commercial use or integration into proprietary systems.
Windows users face a documented installation hurdle for the pynini library, requiring a conda-based workaround.
The project is relatively new, with its core paper submitted in early 2025.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

76 stars in the last 30 days