Multilingual_Text_to_Speech  by Tomiinek

Tacotron 2 implementation for multilingual speech synthesis research

Created 6 years ago
838 stars

Top 42.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides an implementation of Tacotron 2 for multilingual text-to-speech (TTS) synthesis, supporting parameter sharing, code-switching, and voice cloning. It is targeted at researchers and developers working on advanced TTS systems who need to train models on multiple languages or handle mixed-language speech. The project offers a flexible approach to encoder parameter sharing, aiming to balance efficiency with linguistic flexibility.

How It Works

The core of the implementation is a Tacotron 2 architecture adapted for multilingualism. It explores three parameter-sharing strategies for the encoder: full sharing with an adversarial classifier to remove speaker information, language-specific encoders, and a hybrid approach using a parameter generator for language-specific encoder parameters. This hybrid method, combined with domain adversarial training, aims to achieve effective parameter sharing while retaining flexibility for different languages and voices.

Quick Start & Requirements

  • Install dependencies: pip3 install -r requirements.txt
  • Datasets: Requires CSS10 (all languages) and cleaned Common Voice data.
  • Preprocessing: Spectrograms can be precomputed using python3 prepare_css_spectrograms.py.
  • Training: PYTHONIOENCODING=utf-8 python3 train.py --hyper_parameters generated_switching.json
  • Monitoring: Use Tensorboard: tensorboard --logdir logs --port 6666
  • Links: Interactive Demos, Paper

Highlighted Details

  • Implements three distinct encoder parameter-sharing strategies for multilingual TTS.
  • Supports code-switching and cross-language voice cloning.
  • Includes pre-trained models for download.
  • Provides synthesized samples for three compared multilingual TTS models.

Maintenance & Community

  • Primary contributor: Tomáš Nekvinda.
  • Associated paper: "One Model, Many Languages: Meta-Learning for Multilingual Text-to-Speech" (Interspeech 2020).

Licensing & Compatibility

  • Code: MIT License.
  • Data: CSS10 dataset is Apache License 2.0; Common Voice data is CC0.
  • Compatible with commercial use and closed-source linking due to MIT license for the code.

Limitations & Caveats

The README mentions using a WaveRNN vocoder and provides a link to its repository, implying it's a separate dependency. Training requires significant computational resources and dataset preparation.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
1 more.

fish-speech by fishaudio

0.3%
23k
Open-source TTS for multilingual speech synthesis
Created 1 year ago
Updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
6 more.

OpenVoice by myshell-ai

0.2%
34k
Audio foundation model for versatile, instant voice cloning
Created 1 year ago
Updated 5 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.