TTS best practice based on BERT and VITS
Top 33.2% on sourcepulse
This project provides a Chinese Text-to-Speech (TTS) system based on the VITS architecture, enhanced with BERT for natural prosody and NaturalSpeech for reduced sound errors. It targets researchers and developers interested in TTS algorithm learning, offering features like ONNX streaming output and module-wise distillation for speed.
How It Works
The system integrates BERT's hidden prosody embeddings to capture grammatical pauses, improving naturalness. It leverages NaturalSpeech's inference loss to minimize sound artifacts and uses the VITS framework for high audio quality. Module-wise distillation is employed to achieve speedups, making it suitable for applications requiring faster inference.
Quick Start & Requirements
pip install -r requirements.txt
and cd monotonic_align && python setup.py build_ext --inplace
.python vits_infer.py --config ./configs/bert_vits.json --model vits_bert_model.pth
.python vits_infer_stream.py --config ./configs/bert_vits.json --model vits_bert_model.pth
.Highlighted Details
Maintenance & Community
The project is actively maintained by PlayVoice. Links to Hugging Face spaces for demos and model hosting are provided.
Licensing & Compatibility
The project appears to be primarily licensed under MIT, but specific components or dependencies might have different licenses. Compatibility for commercial use should be verified with the specific licenses of all dependencies.
Limitations & Caveats
The project explicitly states it is for TTS algorithm learning and may not be suitable for direct production use. While ONNX export is supported, the README notes potential warnings during export that can be ignored. The natural pauses achieved by BERT might be less pronounced in the no-BERT inference mode or when segmenting speech for low-resource devices.
1 year ago
Inactive