DailyTalk  by keonlee9420

Conversational TTS dataset and baseline for dialogue synthesis

Created 3 years ago
250 stars

Top 100.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

DailyTalk introduces a high-quality spoken dialogue dataset and baseline code for conversational Text-to-Speech (TTS). It addresses the deficiency of conversational context in existing TTS datasets, enabling more natural and context-aware speech synthesis for researchers and developers.

How It Works

The dataset is derived from DailyDialog, enhanced through sampling, modification, and re-recording for improved speech quality. A non-autoregressive TTS model forms the baseline, uniquely conditioned on historical dialogue information. This approach allows the model to effectively capture and leverage conversational context, a key differentiator from utterance-centric TTS systems.

Quick Start & Requirements

  • Installation: pip3 install -r requirements.txt or via Dockerfile.
  • Prerequisites: Download dataset, pretrained models (place in output/ckpt/DailyTalk/), and unzip HiFi-GAN vocoder models. For multi-speaker training, a DeepSpeaker model may be required. Pre-extracted alignments from Montreal Forced Aligner (MFA) are provided or can be generated.
  • Inference: Supports batch inference via python3 synthesize.py --source preprocessed_data/DailyTalk/val_*.txt --restore_step RESTORE_STEP --mode batch --dataset DailyTalk.
  • Training: Requires preprocessing (python3 prepare_align.py, python3 preprocess.py) followed by training (python3 train.py).
  • Links: Dataset download and pretrained models are available.

Highlighted Details

  • Employs unsupervised duration modeling with StyleSpeech's convolutional embedding for phoneme-level variance.
  • Offers bucket-based embedding (FastSpeech2) as an alternative.
  • Utilizes the HiFi-GAN vocoder for all experiments.
  • Baseline TTS model leverages historical dialogue context for improved synthesis.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or roadmaps were found in the provided README content.

Licensing & Compatibility

Licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC-BY-SA 4.0). This license permits commercial use and modification, provided that derivative works are shared under the same terms.

Limitations & Caveats

The system currently only supports batch inference due to its reliance on conversational history. Pretrained models may not have been trained using supervised duration modeling or external speaker embedders.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
Created 2 years ago
Updated 1 year ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
3 more.

ChatTTS by 2noise

0.1%
38k
Generative speech model for daily dialogue
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.