F5-TTS  by SWivid

Speech model for fluent, faithful speech with flow matching

Created 11 months ago
13,242 stars

Top 3.7% on SourcePulse

GitHubView on GitHub
Project Summary

F5-TTS is an open-source toolkit for text-to-speech (TTS) synthesis, offering advanced models like F5-TTS (Diffusion Transformer with ConvNeXt V2) and E2-TTS (Flat-UNet Transformer). It targets researchers and developers seeking high-fidelity, fluent speech generation with improved training and inference speeds, leveraging flow matching techniques.

How It Works

F5-TTS utilizes a Diffusion Transformer architecture with ConvNeXt V2 for its primary model, aiming for faster training and inference compared to traditional methods. The E2-TTS model offers a closer reproduction of the original paper's Flat-UNet Transformer. A key innovation is the "Sway Sampling" strategy, which enhances inference performance by optimizing flow step sampling.

Quick Start & Requirements

  • Installation: Install via pip (pip install f5-tts) for inference, or clone the repository and install editable (pip install -e .) for training. Docker images are also available.
  • Prerequisites: Python 3.10+, PyTorch with CUDA 12.4+ (NVIDIA), ROCm 6.2 (AMD, Linux), XPU (Intel), or standard PyTorch (Apple Silicon).
  • Resources: Requires NVIDIA GPU for optimal performance. Docker deployment examples are provided.
  • Links: Hugging Face, Model Scope, Wisemodel, Training & Finetuning Guidance.

Highlighted Details

  • Achieves low latency (253ms) and high RTF (0.0394) on L20 GPU with client-server setup.
  • Supports offline inference via Triton and TensorRT-LLM.
  • Offers a Gradio web interface for basic TTS, multi-style/speaker generation, and voice chat.
  • Includes CLI inference with options for reference audio/text and custom configurations.

Maintenance & Community

The project acknowledges contributions from various researchers and libraries. It provides links to Hugging Face, ModelScope, and Wisemodel for pre-trained models.

Licensing & Compatibility

The codebase is released under the MIT License. However, pre-trained models are licensed under CC-BY-NC due to the use of the Emilia dataset, restricting commercial use.

Limitations & Caveats

Pre-trained models are restricted to non-commercial use due to dataset licensing. The project is actively developed, with recent updates in March 2025.

Health Check
Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
20
Star History
295 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.