dl-for-emo-tts  by Emotional-Text-to-Speech

Deep learning approaches for emotional text-to-speech

Created 5 years ago
455 stars

Top 66.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository explores deep learning approaches for emotional Text-to-Speech (TTS) synthesis, targeting researchers and practitioners in speech synthesis. It details experimental findings from fine-tuning Tacotron and DC-TTS models on emotional speech datasets, offering insights into effective strategies for low-resource emotional TTS.

How It Works

The project investigates fine-tuning pre-trained Tacotron and DC-TTS models on emotional speech datasets like RAVDESS and EMOV-DB. Key strategies include adjusting learning rates, switching optimizers (Adam to SGD), freezing specific model components (encoder, postnet), and using single-speaker data per emotion. These methods aim to mitigate "catastrophic forgetting" and improve emotional expressiveness in synthesized speech.

Quick Start & Requirements

  • Installation: Requires PyTorch. Specific training scripts are provided for each approach.
  • Dependencies: Python, PyTorch, librosa.
  • Data: Requires datasets like RAVDESS, EMOV-DB, and LJ Speech. Pre-trained models for Tacotron and DC-TTS on LJ Speech are available.
  • Demo: A Colab notebook is provided for demonstration.
  • Code: Modified forks of r9y9/tacotron and tugstugi/dc-tts are used.

Highlighted Details

  • Fine-tuning Tacotron with a frozen post-net and low learning rate on EMOV-DB yielded significantly improved results for emotions like "Disgust," "Sleepiness," and "Amused."
  • Approach 8, replicating a preprint using EMOV-DB with a single female speaker per emotion, top_db=20, and monotonic_attention=True, successfully generated "Anger" with good quality.
  • The project systematically documents failures and successes across multiple fine-tuning strategies, providing valuable empirical data.
  • Datasets like RAVDESS and EMOV-DB are analyzed for their pros and cons regarding emotional expressiveness and data limitations.

Maintenance & Community

The project was released in June 2020 by a team of authors from IIIT Delhi. Contact information for project members is provided for support.

Licensing & Compatibility

The repository is licensed under the MIT License, allowing for commercial use and modification.

Limitations & Caveats

  • Some approaches, particularly early Tacotron fine-tuning attempts, resulted in unintelligible speech or complete failure.
  • Certain emotions (e.g., "Disgust," "Amused," "Sleepiness" in Approach 8) remained challenging due to non-verbal cues in the audio or subtle perceptual differences.
  • The README indicates that the DC-TTS Text2Mel module fine-tuning (Approach 7) resulted in blank spectrograms and no audio output.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.