vits_chinese  by PlayVoice

TTS best practice based on BERT and VITS

Created 4 years ago
1,211 stars

Top 32.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a Chinese Text-to-Speech (TTS) system based on the VITS architecture, enhanced with BERT for natural prosody and NaturalSpeech for reduced sound errors. It targets researchers and developers interested in TTS algorithm learning, offering features like ONNX streaming output and module-wise distillation for speed.

How It Works

The system integrates BERT's hidden prosody embeddings to capture grammatical pauses, improving naturalness. It leverages NaturalSpeech's inference loss to minimize sound artifacts and uses the VITS framework for high audio quality. Module-wise distillation is employed to achieve speedups, making it suitable for applications requiring faster inference.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt and cd monotonic_align && python setup.py build_ext --inplace.
  • Pretrained models and prosody models need to be downloaded from the releases page and placed in specified directories.
  • Inference command: python vits_infer.py --config ./configs/bert_vits.json --model vits_bert_model.pth.
  • Streaming inference: python vits_infer_stream.py --config ./configs/bert_vits.json --model vits_bert_model.pth.
  • ONNX export and inference commands are also provided.
  • Training requires the Baker dataset (16kHz) or AISHELL3 dataset.

Highlighted Details

  • Supports ONNX streaming output for efficient inference.
  • Achieves 3x speedup on a student model via knowledge distillation.
  • Offers a no-BERT inference option for lower computational resources.
  • Includes support for multi-speaker training using AISHELL3 data.

Maintenance & Community

The project is actively maintained by PlayVoice. Links to Hugging Face spaces for demos and model hosting are provided.

Licensing & Compatibility

The project appears to be primarily licensed under MIT, but specific components or dependencies might have different licenses. Compatibility for commercial use should be verified with the specific licenses of all dependencies.

Limitations & Caveats

The project explicitly states it is for TTS algorithm learning and may not be suitable for direct production use. While ONNX export is supported, the README notes potential warnings during export that can be ignored. The natural pauses achieved by BERT might be less pronounced in the no-BERT inference mode or when segmenting speech for low-resource devices.

Health Check
Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.