f5-tts-mlx  by lucasnewman

Text-to-speech implementation using MLX framework

created 9 months ago
567 stars

Top 57.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an MLX implementation of F5-TTS, a non-autoregressive, zero-shot text-to-speech system. It targets users seeking high-quality, fast speech synthesis with voice cloning capabilities, leveraging a flow-matching mel spectrogram generator and a diffusion transformer (DiT).

How It Works

F5-TTS utilizes a flow-matching approach for generating mel spectrograms, combined with a Diffusion Transformer (DiT) for synthesis. It builds upon the E2 TTS architecture, incorporating ConvNeXT v2 blocks to enhance learned text alignment, aiming for improved performance and fidelity. The zero-shot capability allows for voice cloning using reference audio samples.

Quick Start & Requirements

  • Install via pip: pip install f5-tts-mlx
  • Requires macOS with Apple Silicon (MLX framework dependency).
  • Basic usage: python -m f5_tts_mlx.generate --text "..."
  • Voice matching requires a mono, 24kHz WAV file (5-10 seconds).
  • Quantized models (4-bit, 8-bit) are available via the --q flag.
  • Pretrained model weights are available on Hugging Face.

Highlighted Details

  • Generates speech in approximately 4 seconds on an M3 Max MacBook Pro.
  • Supports zero-shot voice cloning with reference audio.
  • Offers quantized models for reduced memory and bandwidth usage.
  • Can be piped with other MLX models, e.g., language models.

Maintenance & Community

This project is based on original implementations by Yushen Chen (F5 TTS) and Phil Wang (E2 TTS). Further community or maintenance details are not specified in the README.

Licensing & Compatibility

  • Released under the MIT license.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The MLX framework is specific to Apple Silicon hardware, limiting its use to macOS users. The project is an implementation of existing research, and its long-term maintenance status is not detailed.

Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
41 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
1 more.

metavoice-src by metavoiceio

0%
4k
TTS model for human-like, expressive speech
created 1 year ago
updated 1 year ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

GPT-SoVITS by RVC-Boss

0.6%
49k
Few-shot voice cloning and TTS web UI
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.