F5-TTS by SWivid

Speech model for fluent, faithful speech with flow matching

Created 1 year ago

13,913 stars

Top 3.5% on SourcePulse

View on GitHub

2 Experts Love This Project

Aakanksha Chowdhery

Author of PaLM; Research Scientist at Reflection AI; Adjunct Professor at Stanford

Tim J. Baek

Founder of Open WebUI

Project Summary

F5-TTS is an open-source toolkit for text-to-speech (TTS) synthesis, offering advanced models like F5-TTS (Diffusion Transformer with ConvNeXt V2) and E2-TTS (Flat-UNet Transformer). It targets researchers and developers seeking high-fidelity, fluent speech generation with improved training and inference speeds, leveraging flow matching techniques.

How It Works

F5-TTS utilizes a Diffusion Transformer architecture with ConvNeXt V2 for its primary model, aiming for faster training and inference compared to traditional methods. The E2-TTS model offers a closer reproduction of the original paper's Flat-UNet Transformer. A key innovation is the "Sway Sampling" strategy, which enhances inference performance by optimizing flow step sampling.

Quick Start & Requirements

Installation: Install via pip (pip install f5-tts) for inference, or clone the repository and install editable (pip install -e .) for training. Docker images are also available.
Prerequisites: Python 3.10+, PyTorch with CUDA 12.4+ (NVIDIA), ROCm 6.2 (AMD, Linux), XPU (Intel), or standard PyTorch (Apple Silicon).
Resources: Requires NVIDIA GPU for optimal performance. Docker deployment examples are provided.
Links: Hugging Face, Model Scope, Wisemodel, Training & Finetuning Guidance.

Highlighted Details

Achieves low latency (253ms) and high RTF (0.0394) on L20 GPU with client-server setup.
Supports offline inference via Triton and TensorRT-LLM.
Offers a Gradio web interface for basic TTS, multi-style/speaker generation, and voice chat.
Includes CLI inference with options for reference audio/text and custom configurations.

Maintenance & Community

The project acknowledges contributions from various researchers and libraries. It provides links to Hugging Face, ModelScope, and Wisemodel for pre-trained models.

Licensing & Compatibility

The codebase is released under the MIT License. However, pre-trained models are licensed under CC-BY-NC due to the use of the Emilia dataset, restricting commercial use.

Limitations & Caveats

Pre-trained models are restricted to non-commercial use due to dataset licensing. The project is actively developed, with recent updates in March 2025.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

182 stars in the last 30 days