dl-for-emo-tts  by Emotional-Text-to-Speech

Deep learning approaches for emotional text-to-speech

Created 5 years ago
460 stars

Top 65.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository explores deep learning approaches for emotional Text-to-Speech (TTS) synthesis, targeting researchers and practitioners in speech synthesis. It details experimental findings from fine-tuning Tacotron and DC-TTS models on emotional speech datasets, offering insights into effective strategies for low-resource emotional TTS.

How It Works

The project investigates fine-tuning pre-trained Tacotron and DC-TTS models on emotional speech datasets like RAVDESS and EMOV-DB. Key strategies include adjusting learning rates, switching optimizers (Adam to SGD), freezing specific model components (encoder, postnet), and using single-speaker data per emotion. These methods aim to mitigate "catastrophic forgetting" and improve emotional expressiveness in synthesized speech.

Quick Start & Requirements

  • Installation: Requires PyTorch. Specific training scripts are provided for each approach.
  • Dependencies: Python, PyTorch, librosa.
  • Data: Requires datasets like RAVDESS, EMOV-DB, and LJ Speech. Pre-trained models for Tacotron and DC-TTS on LJ Speech are available.
  • Demo: A Colab notebook is provided for demonstration.
  • Code: Modified forks of r9y9/tacotron and tugstugi/dc-tts are used.

Highlighted Details

  • Fine-tuning Tacotron with a frozen post-net and low learning rate on EMOV-DB yielded significantly improved results for emotions like "Disgust," "Sleepiness," and "Amused."
  • Approach 8, replicating a preprint using EMOV-DB with a single female speaker per emotion, top_db=20, and monotonic_attention=True, successfully generated "Anger" with good quality.
  • The project systematically documents failures and successes across multiple fine-tuning strategies, providing valuable empirical data.
  • Datasets like RAVDESS and EMOV-DB are analyzed for their pros and cons regarding emotional expressiveness and data limitations.

Maintenance & Community

The project was released in June 2020 by a team of authors from IIIT Delhi. Contact information for project members is provided for support.

Licensing & Compatibility

The repository is licensed under the MIT License, allowing for commercial use and modification.

Limitations & Caveats

  • Some approaches, particularly early Tacotron fine-tuning attempts, resulted in unintelligible speech or complete failure.
  • Certain emotions (e.g., "Disgust," "Amused," "Sleepiness" in Approach 8) remained challenging due to non-verbal cues in the audio or subtle perceptual differences.
  • The README indicates that the DC-TTS Text2Mel module fine-tuning (Approach 7) resulted in blank spectrograms and no audio output.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

Orpheus-TTS by canopyai

0.2%
6k
Open-source TTS for human-sounding speech, built on Llama-3b
Created 10 months ago
Updated 1 month ago
Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.