indexTTS2  by iszhanjiawei

Expressive zero-shot TTS with precise duration and emotion control

Created 2 months ago
335 stars

Top 81.9% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> IndexTTS2 is an autoregressive text-to-speech (TTS) system addressing precise duration control and emotionally expressive zero-shot synthesis. It targets applications requiring synchronized audio-visual output or nuanced emotional vocalizations, offering independent control over speaker timbre and emotion for enhanced naturalness and clarity.

How It Works

This system introduces a novel, general method for duration control within autoregressive TTS, enabling both explicit token count specification for precise timing and free generation. It achieves disentanglement of speaker identity and emotional expression, allowing zero-shot reproduction of emotions from prompts or independent emotional guidance via audio or natural language descriptions. GPT latent representations enhance stability during intense emotional expressions, while a Qwen3-tuned soft instruction mechanism facilitates emotion control through text.

Quick Start & Requirements

  • Installation: Clone the repository, create a Python 3.10 conda environment, activate it, and install dependencies (pip install -r requirements.txt). For CLI use, pip install -e .; for the web UI, pip install -e ".[webui]".
  • Prerequisites: Python 3.10, PyTorch. Windows users may need conda install -c conda-forge pynini==2.1.5. Model weights must be downloaded separately (e.g., via huggingface-cli or wget).
  • Links: HuggingFace Demo, Paper, GitHub.

Highlighted Details

  • First autoregressive TTS with precise duration control (controllable/free modes).
  • Highly expressive emotional speech synthesis with multi-modal emotion control.
  • Decoupled speaker timbre and emotional features for independent manipulation.
  • GPT latent representations improve stability during strong emotional expressions.
  • Soft instruction mechanism enables emotion guidance via textual descriptions.
  • Claims superior performance over SOTA zero-shot TTS in WER, speaker similarity, and emotional fidelity.

Maintenance & Community

  • Active development with releases in 2025.
  • Community channels include a QQ group (1048202584) and a Discord server (https://discord.gg/uT32E7KDmy).
  • Contact emails: zhousiyi02@bilibili.com, zhouxun@bilibili.com, indexspeech@bilibili.com.

Licensing & Compatibility

  • License: Not specified in the README. This is a critical omission for adoption decisions.
  • Compatibility: Primarily Python-based. Windows installation may require specific conda steps for pynini. GPU acceleration is likely beneficial but not strictly required for basic inference.

Limitations & Caveats

  • The absence of a specified license poses a significant adoption blocker, particularly for commercial use or integration into proprietary systems.
  • Windows users face a documented installation hurdle for the pynini library, requiring a conda-based workaround.
  • The project is relatively new, with its core paper submitted in early 2025.
Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
138 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
3 more.

ChatTTS by 2noise

0.1%
38k
Generative speech model for daily dialogue
Created 1 year ago
Updated 4 months ago
Feedback? Help us improve.