Speech model for fluent, faithful speech with flow matching
Top 3.9% on sourcepulse
F5-TTS is an open-source toolkit for text-to-speech (TTS) synthesis, offering advanced models like F5-TTS (Diffusion Transformer with ConvNeXt V2) and E2-TTS (Flat-UNet Transformer). It targets researchers and developers seeking high-fidelity, fluent speech generation with improved training and inference speeds, leveraging flow matching techniques.
How It Works
F5-TTS utilizes a Diffusion Transformer architecture with ConvNeXt V2 for its primary model, aiming for faster training and inference compared to traditional methods. The E2-TTS model offers a closer reproduction of the original paper's Flat-UNet Transformer. A key innovation is the "Sway Sampling" strategy, which enhances inference performance by optimizing flow step sampling.
Quick Start & Requirements
pip install f5-tts
) for inference, or clone the repository and install editable (pip install -e .
) for training. Docker images are also available.Highlighted Details
Maintenance & Community
The project acknowledges contributions from various researchers and libraries. It provides links to Hugging Face, ModelScope, and Wisemodel for pre-trained models.
Licensing & Compatibility
The codebase is released under the MIT License. However, pre-trained models are licensed under CC-BY-NC due to the use of the Emilia dataset, restricting commercial use.
Limitations & Caveats
Pre-trained models are restricted to non-commercial use due to dataset licensing. The project is actively developed, with recent updates in March 2025.
1 week ago
1 day