DMOSpeech2 by yl4579

Metric-optimized speech synthesis with RL

Created 5 months ago

294 stars

Top 90.1% on SourcePulse

Project Summary

DMOSpeech 2 addresses the challenge of optimizing all components in diffusion-based text-to-speech (TTS) systems for perceptual metrics, specifically focusing on duration prediction. It extends prior metric optimization work by employing reinforcement learning for duration prediction, targeting researchers and developers in speech synthesis seeking improved quality and efficiency.

How It Works

This system introduces a novel duration policy framework using Group Relative Preference Optimization (GRPO) to optimize the duration predictor. It leverages speaker similarity and word error rate as reward signals, aiming for a more complete metric-optimized synthesis pipeline. Additionally, it incorporates teacher-guided sampling, a hybrid approach that uses a teacher model for initial denoising steps before transitioning to a student model, enhancing output diversity while maintaining efficiency.

Quick Start & Requirements

Install: Create a Python 3.10 environment (conda create -n dmo2 python=3.10), activate it (conda activate dmo2), clone the repo, and install requirements (pip install -r requirements.txt).
Prerequisites: Python 3.10, Conda.
Inference: Download checkpoints (model_1500.pt, model_85000.pt) from Huggingface to a ckpts folder. Run demo.ipynb for inference.
Links: DMOSpeech 2 GitHub

Highlighted Details

Reinforcement learning for duration prediction using GRPO.
Reward signals include speaker similarity and word error rate.
Teacher-guided sampling for improved diversity and efficiency.
Claims superior performance across metrics and reduced sampling steps by half.

Maintenance & Community

The project is a collaboration with Newsbreak. TODO items indicate ongoing development for training code and fine-tuning.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. The project is modified from the F5-TTS repo, which may have its own licensing implications.

Limitations & Caveats

The training code is under construction and not yet tested. Vocoder fine-tuning or HiFTNet training is pending for higher acoustic quality. Streaming/concatenating inference is also a future TODO.

DMOSpeech2 by yl4579

Explore Similar Projects

pheme by PolyAI-LDN

Meta-voicebox by SpeechifyInc

StyleSpeech by KevinMIN95

DiffGAN-TTS by keonlee9420

ProDiff by Rongjiehuang

StyleTTS by yl4579

MiMo-Audio by XiaomiMiMo

HierSpeechpp by sh-lee-prml

parler-tts by huggingface

higgs-audio by boson-ai

StyleTTS2 by yl4579

vits by jaywalnut310