Discover and explore top open-source AI tools and projects—updated daily.
Metric-optimized speech synthesis with RL
Top 93.0% on SourcePulse
DMOSpeech 2 addresses the challenge of optimizing all components in diffusion-based text-to-speech (TTS) systems for perceptual metrics, specifically focusing on duration prediction. It extends prior metric optimization work by employing reinforcement learning for duration prediction, targeting researchers and developers in speech synthesis seeking improved quality and efficiency.
How It Works
This system introduces a novel duration policy framework using Group Relative Preference Optimization (GRPO) to optimize the duration predictor. It leverages speaker similarity and word error rate as reward signals, aiming for a more complete metric-optimized synthesis pipeline. Additionally, it incorporates teacher-guided sampling, a hybrid approach that uses a teacher model for initial denoising steps before transitioning to a student model, enhancing output diversity while maintaining efficiency.
Quick Start & Requirements
conda create -n dmo2 python=3.10
), activate it (conda activate dmo2
), clone the repo, and install requirements (pip install -r requirements.txt
).model_1500.pt
, model_85000.pt
) from Huggingface to a ckpts
folder. Run demo.ipynb
for inference.Highlighted Details
Maintenance & Community
The project is a collaboration with Newsbreak. TODO items indicate ongoing development for training code and fine-tuning.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. The project is modified from the F5-TTS repo, which may have its own licensing implications.
Limitations & Caveats
The training code is under construction and not yet tested. Vocoder fine-tuning or HiFTNet training is pending for higher acoustic quality. Streaming/concatenating inference is also a future TODO.
1 month ago
Inactive