VoiceFlow-TTS by X-LANCE

TTS research paper using rectified flow matching

Created 2 years ago

365 stars

Top 77.1% on SourcePulse

Project Summary

VoiceFlow is an efficient text-to-speech system that leverages rectified flow matching to achieve high-quality speech synthesis. It is designed for researchers and practitioners in speech processing who are looking for advanced TTS models with a focus on speed-quality trade-offs. The project provides an official implementation of the ICASSP 2024 paper "VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching."

How It Works

VoiceFlow utilizes a flow matching approach, specifically rectified flow, to model the generative process of speech. This method involves training a neural network to learn a vector field that transforms a simple prior distribution (e.g., Gaussian noise) into the target data distribution (e.g., mel-spectrograms). The "rectified" aspect implies a specific formulation or training strategy for the flow matching objective, aiming for improved efficiency and quality. This approach offers an alternative to diffusion models and GANs, potentially providing faster sampling and better control over the generation process.

Quick Start & Requirements

Installation: Requires Python 3.9 and Linux. Environment setup via Conda: conda create -n vflow python==3.9, conda activate vflow, pip install -r requirements.txt, source path.sh. Also requires monotonic_align installation (cd model/monotonic_align; python setup.py build_ext --inplace).
Prerequisites: Kaldi-style data organization is expected. Data preparation involves extracting mel-spectrograms using bash extract_fbank.sh. Requires 16kHz audio data.
Training: Configured via YAML files in configs/. Training command: python train.py -c configs/${your_yaml} -m ${model_name}.
Resources: Training and data preparation can be resource-intensive. The project mentions using multiple GPUs for parallel decoding during the reflow process.
Documentation: Official implementation of the ICASSP 2024 paper.

Highlighted Details

Implements "Rectified Flow Matching" for efficient TTS.
Supports training with ground truth durations or using the Monotonic Alignment Search (MAS) algorithm.
Includes a "ReFlow" process for further model improvement by generating data with the trained model and retraining.
Offers experimental features like voice conversion and likelihood estimation.

Maintenance & Community

The project is associated with the ICASSP 2024 paper.
References Kaldi, UniCATS-CTX-vec2wav, GradTTS, VITS, and CFM for utility scripts and architectural inspiration.
No specific community links (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. However, the project's structure and dependencies suggest it is intended for research purposes. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Experimental functionalities are marked with a warning and are not guaranteed to be correct.
The project relies heavily on Kaldi-style data preparation, which may require significant effort for users with custom datasets.
The README notes that some experimental features, like Optimal Transport, "does not work very well for now."

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

speech-recognition-uk by egorsmkv

Resource collection for Ukrainian speech AI

Created 5 years ago

Updated 4 months ago

Meta-voicebox by SpeechifyInc

PyTorch implementation of Meta's Voicebox speech model

Created 2 years ago

Updated 2 years ago

Starred by

Piotr Dąbkowski

Piotr Dąbkowski(Cofounder of ElevenLabs).

assem-vc by maum-ai

PyTorch code for any-to-many voice conversion research

Created 4 years ago

Updated 3 years ago

GenerSpeech by Rongjiehuang

Text-to-speech model for zero-shot style transfer of custom voice

Created 3 years ago

Updated 1 year ago

voicebox-pytorch by lucidrains

Pytorch implementation of MetaAI's Voicebox text-to-speech model

Created 2 years ago

Updated 1 year ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

FastDiff by Rongjiehuang

PyTorch implementation for fast, high-fidelity speech synthesis via conditional diffusion

Created 4 years ago

Updated 1 year ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm).

f5-tts-mlx by lucasnewman

Text-to-speech implementation using MLX framework

Created 1 year ago

Updated 9 months ago

Starred by

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

vits2 by daniilrobnikov

Unofficial VITS2 implementation for single-stage text-to-speech research

Created 2 years ago

Updated 2 years ago

FireRedTTS by FireRedTeam

LLM-empowered TTS system for research

Created 1 year ago

Updated 3 months ago

Starred by

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

speech-synthesis-paper by wenet-e2e

Speech synthesis papers list

Created 5 years ago

Updated 2 years ago

TransformerTTS by spring-media

TensorFlow 2 implementation for non-autoregressive text-to-speech

Created 5 years ago

Updated 1 year ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm) and

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

Few-shot voice cloning and TTS web UI

Created 2 years ago

Updated 1 week ago

Feedback? Help us improve.