dl-for-emo-tts by Emotional-Text-to-Speech

Deep learning approaches for emotional text-to-speech

Created 5 years ago

460 stars

Top 65.8% on SourcePulse

Project Summary

This repository explores deep learning approaches for emotional Text-to-Speech (TTS) synthesis, targeting researchers and practitioners in speech synthesis. It details experimental findings from fine-tuning Tacotron and DC-TTS models on emotional speech datasets, offering insights into effective strategies for low-resource emotional TTS.

How It Works

The project investigates fine-tuning pre-trained Tacotron and DC-TTS models on emotional speech datasets like RAVDESS and EMOV-DB. Key strategies include adjusting learning rates, switching optimizers (Adam to SGD), freezing specific model components (encoder, postnet), and using single-speaker data per emotion. These methods aim to mitigate "catastrophic forgetting" and improve emotional expressiveness in synthesized speech.

Quick Start & Requirements

Installation: Requires PyTorch. Specific training scripts are provided for each approach.
Dependencies: Python, PyTorch, librosa.
Data: Requires datasets like RAVDESS, EMOV-DB, and LJ Speech. Pre-trained models for Tacotron and DC-TTS on LJ Speech are available.
Demo: A Colab notebook is provided for demonstration.
Code: Modified forks of r9y9/tacotron and tugstugi/dc-tts are used.

Highlighted Details

Fine-tuning Tacotron with a frozen post-net and low learning rate on EMOV-DB yielded significantly improved results for emotions like "Disgust," "Sleepiness," and "Amused."
Approach 8, replicating a preprint using EMOV-DB with a single female speaker per emotion, top_db=20, and monotonic_attention=True, successfully generated "Anger" with good quality.
The project systematically documents failures and successes across multiple fine-tuning strategies, providing valuable empirical data.
Datasets like RAVDESS and EMOV-DB are analyzed for their pros and cons regarding emotional expressiveness and data limitations.

Maintenance & Community

The project was released in June 2020 by a team of authors from IIIT Delhi. Contact information for project members is provided for support.

Licensing & Compatibility

The repository is licensed under the MIT License, allowing for commercial use and modification.

Limitations & Caveats

Some approaches, particularly early Tacotron fine-tuning attempts, resulted in unintelligible speech or complete failure.
Certain emotions (e.g., "Disgust," "Amused," "Sleepiness" in Approach 8) remained challenging due to non-verbal cues in the audio or subtle perceptual differences.
The README indicates that the DC-TTS Text2Mel module fine-tuning (Approach 7) resulted in blank spectrograms and no audio output.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days