VoiceFlow-TTS  by X-LANCE

TTS research paper using rectified flow matching

created 1 year ago
352 stars

Top 80.3% on sourcepulse

GitHubView on GitHub
Project Summary

VoiceFlow is an efficient text-to-speech system that leverages rectified flow matching to achieve high-quality speech synthesis. It is designed for researchers and practitioners in speech processing who are looking for advanced TTS models with a focus on speed-quality trade-offs. The project provides an official implementation of the ICASSP 2024 paper "VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching."

How It Works

VoiceFlow utilizes a flow matching approach, specifically rectified flow, to model the generative process of speech. This method involves training a neural network to learn a vector field that transforms a simple prior distribution (e.g., Gaussian noise) into the target data distribution (e.g., mel-spectrograms). The "rectified" aspect implies a specific formulation or training strategy for the flow matching objective, aiming for improved efficiency and quality. This approach offers an alternative to diffusion models and GANs, potentially providing faster sampling and better control over the generation process.

Quick Start & Requirements

  • Installation: Requires Python 3.9 and Linux. Environment setup via Conda: conda create -n vflow python==3.9, conda activate vflow, pip install -r requirements.txt, source path.sh. Also requires monotonic_align installation (cd model/monotonic_align; python setup.py build_ext --inplace).
  • Prerequisites: Kaldi-style data organization is expected. Data preparation involves extracting mel-spectrograms using bash extract_fbank.sh. Requires 16kHz audio data.
  • Training: Configured via YAML files in configs/. Training command: python train.py -c configs/${your_yaml} -m ${model_name}.
  • Resources: Training and data preparation can be resource-intensive. The project mentions using multiple GPUs for parallel decoding during the reflow process.
  • Documentation: Official implementation of the ICASSP 2024 paper.

Highlighted Details

  • Implements "Rectified Flow Matching" for efficient TTS.
  • Supports training with ground truth durations or using the Monotonic Alignment Search (MAS) algorithm.
  • Includes a "ReFlow" process for further model improvement by generating data with the trained model and retraining.
  • Offers experimental features like voice conversion and likelihood estimation.

Maintenance & Community

  • The project is associated with the ICASSP 2024 paper.
  • References Kaldi, UniCATS-CTX-vec2wav, GradTTS, VITS, and CFM for utility scripts and architectural inspiration.
  • No specific community links (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. However, the project's structure and dependencies suggest it is intended for research purposes. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Experimental functionalities are marked with a warning and are not guaranteed to be correct.
  • The project relies heavily on Kaldi-style data preparation, which may require significant effort for users with custom datasets.
  • The README notes that some experimental features, like Optimal Transport, "does not work very well for now."
Health Check
Last commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.