TTS research paper using rectified flow matching
Top 80.3% on sourcepulse
VoiceFlow is an efficient text-to-speech system that leverages rectified flow matching to achieve high-quality speech synthesis. It is designed for researchers and practitioners in speech processing who are looking for advanced TTS models with a focus on speed-quality trade-offs. The project provides an official implementation of the ICASSP 2024 paper "VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching."
How It Works
VoiceFlow utilizes a flow matching approach, specifically rectified flow, to model the generative process of speech. This method involves training a neural network to learn a vector field that transforms a simple prior distribution (e.g., Gaussian noise) into the target data distribution (e.g., mel-spectrograms). The "rectified" aspect implies a specific formulation or training strategy for the flow matching objective, aiming for improved efficiency and quality. This approach offers an alternative to diffusion models and GANs, potentially providing faster sampling and better control over the generation process.
Quick Start & Requirements
conda create -n vflow python==3.9
, conda activate vflow
, pip install -r requirements.txt
, source path.sh
. Also requires monotonic_align
installation (cd model/monotonic_align; python setup.py build_ext --inplace
).bash extract_fbank.sh
. Requires 16kHz audio data.configs/
. Training command: python train.py -c configs/${your_yaml} -m ${model_name}
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
11 months ago
1 day