DailyTalk by keonlee9420

Conversational TTS dataset and baseline for dialogue synthesis

Created 3 years ago

252 stars

Top 99.6% on SourcePulse

Project Summary

Summary

DailyTalk introduces a high-quality spoken dialogue dataset and baseline code for conversational Text-to-Speech (TTS). It addresses the deficiency of conversational context in existing TTS datasets, enabling more natural and context-aware speech synthesis for researchers and developers.

How It Works

The dataset is derived from DailyDialog, enhanced through sampling, modification, and re-recording for improved speech quality. A non-autoregressive TTS model forms the baseline, uniquely conditioned on historical dialogue information. This approach allows the model to effectively capture and leverage conversational context, a key differentiator from utterance-centric TTS systems.

Quick Start & Requirements

Installation: pip3 install -r requirements.txt or via Dockerfile.
Prerequisites: Download dataset, pretrained models (place in output/ckpt/DailyTalk/), and unzip HiFi-GAN vocoder models. For multi-speaker training, a DeepSpeaker model may be required. Pre-extracted alignments from Montreal Forced Aligner (MFA) are provided or can be generated.
Inference: Supports batch inference via python3 synthesize.py --source preprocessed_data/DailyTalk/val_*.txt --restore_step RESTORE_STEP --mode batch --dataset DailyTalk.
Training: Requires preprocessing (python3 prepare_align.py, python3 preprocess.py) followed by training (python3 train.py).
Links: Dataset download and pretrained models are available.

Highlighted Details

Employs unsupervised duration modeling with StyleSpeech's convolutional embedding for phoneme-level variance.
Offers bucket-based embedding (FastSpeech2) as an alternative.
Utilizes the HiFi-GAN vocoder for all experiments.
Baseline TTS model leverages historical dialogue context for improved synthesis.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or roadmaps were found in the provided README content.

Licensing & Compatibility

Licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC-BY-SA 4.0). This license permits commercial use and modification, provided that derivative works are shared under the same terms.

Limitations & Caveats

The system currently only supports batch inference due to its reliance on conversational history. Pretrained models may not have been trained using supervised duration modeling or external speaker embedders.

DailyTalk by keonlee9420

Explore Similar Projects

MGM-Omni by JIA-Lab-research

speech-recognition-uk by egorsmkv

Meta-voicebox by SpeechifyInc

DiffGAN-TTS by keonlee9420

ZipVoice by k2-fsa

FireRedTTS2 by FireRedTeam

speech-synthesis-paper by wenet-e2e

KittenTTS by KittenML

StyleTTS2 by yl4579

Zonos by Zyphra

dia by nari-labs

ChatTTS by 2noise