Fast zero-shot TTS with flow matching
Top 83.0% on sourcepulse
ZipVoice offers fast, high-quality zero-shot text-to-speech (TTS) and spoken dialogue generation using flow matching. It targets researchers and developers needing efficient, natural-sounding voice cloning and multi-speaker conversations, supporting both Chinese and English.
How It Works
ZipVoice leverages flow matching, a generative modeling technique, to achieve state-of-the-art TTS performance. This approach allows for efficient, non-autoregressive synthesis, resulting in faster inference speeds compared to traditional autoregressive models. The models are small, with the base version having only 123M parameters, making them suitable for deployment in resource-constrained environments.
Quick Start & Requirements
pip install -r requirements.txt
after cloning the repository.k2
for training and faster inference, ensuring compatibility with PyTorch and CUDA versions (e.g., pip install k2==1.24.4.dev20250208+cuda12.1.torch2.5.1 -f https://k2-fsa.github.io/k2/cuda.html
).Highlighted Details
ZipVoice-Distill
) for improved speed with minimal performance loss.ZipVoice-Dialog
, ZipVoice-Dialog-Stereo
).Maintenance & Community
The project is actively developed by k2-fsa, with recent releases of dialogue models and the OpenDialog dataset. Discussions can be held via GitHub Issues.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source use.
Limitations & Caveats
The README does not specify any explicit limitations or known bugs. Installation of k2
requires careful attention to PyTorch and CUDA version compatibility.
4 days ago
Inactive