Multilingual_Text_to_Speech by Tomiinek

Tacotron 2 implementation for multilingual speech synthesis research

Created 6 years ago

842 stars

Top 42.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Benjamin Bolte

Cofounder of K-Scale Labs

Chenlin Meng

Cofounder of Pika

Project Summary

This repository provides an implementation of Tacotron 2 for multilingual text-to-speech (TTS) synthesis, supporting parameter sharing, code-switching, and voice cloning. It is targeted at researchers and developers working on advanced TTS systems who need to train models on multiple languages or handle mixed-language speech. The project offers a flexible approach to encoder parameter sharing, aiming to balance efficiency with linguistic flexibility.

How It Works

The core of the implementation is a Tacotron 2 architecture adapted for multilingualism. It explores three parameter-sharing strategies for the encoder: full sharing with an adversarial classifier to remove speaker information, language-specific encoders, and a hybrid approach using a parameter generator for language-specific encoder parameters. This hybrid method, combined with domain adversarial training, aims to achieve effective parameter sharing while retaining flexibility for different languages and voices.

Quick Start & Requirements

Install dependencies: pip3 install -r requirements.txt
Datasets: Requires CSS10 (all languages) and cleaned Common Voice data.
Preprocessing: Spectrograms can be precomputed using python3 prepare_css_spectrograms.py.
Training: PYTHONIOENCODING=utf-8 python3 train.py --hyper_parameters generated_switching.json
Monitoring: Use Tensorboard: tensorboard --logdir logs --port 6666
Links: Interactive Demos, Paper

Highlighted Details

Implements three distinct encoder parameter-sharing strategies for multilingual TTS.
Supports code-switching and cross-language voice cloning.
Includes pre-trained models for download.
Provides synthesized samples for three compared multilingual TTS models.

Maintenance & Community

Primary contributor: Tomáš Nekvinda.
Associated paper: "One Model, Many Languages: Meta-Learning for Multilingual Text-to-Speech" (Interspeech 2020).

Licensing & Compatibility

Code: MIT License.
Data: CSS10 dataset is Apache License 2.0; Common Voice data is CC0.
Compatible with commercial use and closed-source linking due to MIT license for the code.

Limitations & Caveats

The README mentions using a WaveRNN vocoder and provides a link to its repository, implying it's a separate dependency. Training requires significant computational resources and dataset preparation.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days