Multilingual_Text_to_Speech  by Tomiinek

Tacotron 2 implementation for multilingual speech synthesis research

created 6 years ago
838 stars

Top 43.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides an implementation of Tacotron 2 for multilingual text-to-speech (TTS) synthesis, supporting parameter sharing, code-switching, and voice cloning. It is targeted at researchers and developers working on advanced TTS systems who need to train models on multiple languages or handle mixed-language speech. The project offers a flexible approach to encoder parameter sharing, aiming to balance efficiency with linguistic flexibility.

How It Works

The core of the implementation is a Tacotron 2 architecture adapted for multilingualism. It explores three parameter-sharing strategies for the encoder: full sharing with an adversarial classifier to remove speaker information, language-specific encoders, and a hybrid approach using a parameter generator for language-specific encoder parameters. This hybrid method, combined with domain adversarial training, aims to achieve effective parameter sharing while retaining flexibility for different languages and voices.

Quick Start & Requirements

  • Install dependencies: pip3 install -r requirements.txt
  • Datasets: Requires CSS10 (all languages) and cleaned Common Voice data.
  • Preprocessing: Spectrograms can be precomputed using python3 prepare_css_spectrograms.py.
  • Training: PYTHONIOENCODING=utf-8 python3 train.py --hyper_parameters generated_switching.json
  • Monitoring: Use Tensorboard: tensorboard --logdir logs --port 6666
  • Links: Interactive Demos, Paper

Highlighted Details

  • Implements three distinct encoder parameter-sharing strategies for multilingual TTS.
  • Supports code-switching and cross-language voice cloning.
  • Includes pre-trained models for download.
  • Provides synthesized samples for three compared multilingual TTS models.

Maintenance & Community

  • Primary contributor: Tomáš Nekvinda.
  • Associated paper: "One Model, Many Languages: Meta-Learning for Multilingual Text-to-Speech" (Interspeech 2020).

Licensing & Compatibility

  • Code: MIT License.
  • Data: CSS10 dataset is Apache License 2.0; Common Voice data is CC0.
  • Compatible with commercial use and closed-source linking due to MIT license for the code.

Limitations & Caveats

The README mentions using a WaveRNN vocoder and provides a link to its repository, implying it's a separate dependency. Training requires significant computational resources and dataset preparation.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.