DMOSpeech2  by yl4579

Metric-optimized speech synthesis with RL

Created 1 month ago
280 stars

Top 93.0% on SourcePulse

GitHubView on GitHub
Project Summary

DMOSpeech 2 addresses the challenge of optimizing all components in diffusion-based text-to-speech (TTS) systems for perceptual metrics, specifically focusing on duration prediction. It extends prior metric optimization work by employing reinforcement learning for duration prediction, targeting researchers and developers in speech synthesis seeking improved quality and efficiency.

How It Works

This system introduces a novel duration policy framework using Group Relative Preference Optimization (GRPO) to optimize the duration predictor. It leverages speaker similarity and word error rate as reward signals, aiming for a more complete metric-optimized synthesis pipeline. Additionally, it incorporates teacher-guided sampling, a hybrid approach that uses a teacher model for initial denoising steps before transitioning to a student model, enhancing output diversity while maintaining efficiency.

Quick Start & Requirements

  • Install: Create a Python 3.10 environment (conda create -n dmo2 python=3.10), activate it (conda activate dmo2), clone the repo, and install requirements (pip install -r requirements.txt).
  • Prerequisites: Python 3.10, Conda.
  • Inference: Download checkpoints (model_1500.pt, model_85000.pt) from Huggingface to a ckpts folder. Run demo.ipynb for inference.
  • Links: DMOSpeech 2 GitHub

Highlighted Details

  • Reinforcement learning for duration prediction using GRPO.
  • Reward signals include speaker similarity and word error rate.
  • Teacher-guided sampling for improved diversity and efficiency.
  • Claims superior performance across metrics and reduced sampling steps by half.

Maintenance & Community

The project is a collaboration with Newsbreak. TODO items indicate ongoing development for training code and fine-tuning.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. The project is modified from the F5-TTS repo, which may have its own licensing implications.

Limitations & Caveats

The training code is under construction and not yet tested. Vocoder fine-tuning or HiFTNet training is pending for higher acoustic quality. Streaming/concatenating inference is also a future TODO.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.