Discover and explore top open-source AI tools and projects—updated daily.
iszhanjiaweiExpressive zero-shot TTS with precise duration and emotion control
Top 81.9% on SourcePulse
<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> IndexTTS2 is an autoregressive text-to-speech (TTS) system addressing precise duration control and emotionally expressive zero-shot synthesis. It targets applications requiring synchronized audio-visual output or nuanced emotional vocalizations, offering independent control over speaker timbre and emotion for enhanced naturalness and clarity.
How It Works
This system introduces a novel, general method for duration control within autoregressive TTS, enabling both explicit token count specification for precise timing and free generation. It achieves disentanglement of speaker identity and emotional expression, allowing zero-shot reproduction of emotions from prompts or independent emotional guidance via audio or natural language descriptions. GPT latent representations enhance stability during intense emotional expressions, while a Qwen3-tuned soft instruction mechanism facilitates emotion control through text.
Quick Start & Requirements
pip install -r requirements.txt). For CLI use, pip install -e .; for the web UI, pip install -e ".[webui]".conda install -c conda-forge pynini==2.1.5. Model weights must be downloaded separately (e.g., via huggingface-cli or wget).Highlighted Details
Maintenance & Community
1048202584) and a Discord server (https://discord.gg/uT32E7KDmy).zhousiyi02@bilibili.com, zhouxun@bilibili.com, indexspeech@bilibili.com.Licensing & Compatibility
pynini. GPU acceleration is likely beneficial but not strictly required for basic inference.Limitations & Caveats
pynini library, requiring a conda-based workaround.1 month ago
Inactive
2noise