F5-TTS  by SWivid

Speech model for fluent, faithful speech with flow matching

created 9 months ago
12,805 stars

Top 3.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

F5-TTS is an open-source toolkit for text-to-speech (TTS) synthesis, offering advanced models like F5-TTS (Diffusion Transformer with ConvNeXt V2) and E2-TTS (Flat-UNet Transformer). It targets researchers and developers seeking high-fidelity, fluent speech generation with improved training and inference speeds, leveraging flow matching techniques.

How It Works

F5-TTS utilizes a Diffusion Transformer architecture with ConvNeXt V2 for its primary model, aiming for faster training and inference compared to traditional methods. The E2-TTS model offers a closer reproduction of the original paper's Flat-UNet Transformer. A key innovation is the "Sway Sampling" strategy, which enhances inference performance by optimizing flow step sampling.

Quick Start & Requirements

  • Installation: Install via pip (pip install f5-tts) for inference, or clone the repository and install editable (pip install -e .) for training. Docker images are also available.
  • Prerequisites: Python 3.10+, PyTorch with CUDA 12.4+ (NVIDIA), ROCm 6.2 (AMD, Linux), XPU (Intel), or standard PyTorch (Apple Silicon).
  • Resources: Requires NVIDIA GPU for optimal performance. Docker deployment examples are provided.
  • Links: Hugging Face, Model Scope, Wisemodel, Training & Finetuning Guidance.

Highlighted Details

  • Achieves low latency (253ms) and high RTF (0.0394) on L20 GPU with client-server setup.
  • Supports offline inference via Triton and TensorRT-LLM.
  • Offers a Gradio web interface for basic TTS, multi-style/speaker generation, and voice chat.
  • Includes CLI inference with options for reference audio/text and custom configurations.

Maintenance & Community

The project acknowledges contributions from various researchers and libraries. It provides links to Hugging Face, ModelScope, and Wisemodel for pre-trained models.

Licensing & Compatibility

The codebase is released under the MIT License. However, pre-trained models are licensed under CC-BY-NC due to the use of the Emilia dataset, restricting commercial use.

Limitations & Caveats

Pre-trained models are restricted to non-commercial use due to dataset licensing. The project is actively developed, with recent updates in March 2025.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
6
Issues (30d)
40
Star History
1,243 stars in the last 90 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

GPT-SoVITS by RVC-Boss

0.6%
49k
Few-shot voice cloning and TTS web UI
created 1 year ago
updated 2 weeks ago
Starred by Michael Han Michael Han(Cofounder of Unsloth), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

TTS by coqui-ai

0.4%
42k
Deep learning toolkit for Text-to-Speech, research-tested
created 5 years ago
updated 11 months ago
Feedback? Help us improve.