vits_chinese by PlayVoice

TTS best practice based on BERT and VITS

Created 4 years ago

1,223 stars

Top 32.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

This project provides a Chinese Text-to-Speech (TTS) system based on the VITS architecture, enhanced with BERT for natural prosody and NaturalSpeech for reduced sound errors. It targets researchers and developers interested in TTS algorithm learning, offering features like ONNX streaming output and module-wise distillation for speed.

How It Works

The system integrates BERT's hidden prosody embeddings to capture grammatical pauses, improving naturalness. It leverages NaturalSpeech's inference loss to minimize sound artifacts and uses the VITS framework for high audio quality. Module-wise distillation is employed to achieve speedups, making it suitable for applications requiring faster inference.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt and cd monotonic_align && python setup.py build_ext --inplace.
Pretrained models and prosody models need to be downloaded from the releases page and placed in specified directories.
Inference command: python vits_infer.py --config ./configs/bert_vits.json --model vits_bert_model.pth.
Streaming inference: python vits_infer_stream.py --config ./configs/bert_vits.json --model vits_bert_model.pth.
ONNX export and inference commands are also provided.
Training requires the Baker dataset (16kHz) or AISHELL3 dataset.

Highlighted Details

Supports ONNX streaming output for efficient inference.
Achieves 3x speedup on a student model via knowledge distillation.
Offers a no-BERT inference option for lower computational resources.
Includes support for multi-speaker training using AISHELL3 data.

Maintenance & Community

The project is actively maintained by PlayVoice. Links to Hugging Face spaces for demos and model hosting are provided.

Licensing & Compatibility

The project appears to be primarily licensed under MIT, but specific components or dependencies might have different licenses. Compatibility for commercial use should be verified with the specific licenses of all dependencies.

Limitations & Caveats

The project explicitly states it is for TTS algorithm learning and may not be suitable for direct production use. While ONNX export is supported, the README notes potential warnings during export that can be ignored. The natural pauses achieved by BERT might be less pronounced in the no-BERT inference mode or when segmenting speech for low-resource devices.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days