f5-tts-mlx  by lucasnewman

Text-to-speech implementation using MLX framework

Created 11 months ago
583 stars

Top 55.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides an MLX implementation of F5-TTS, a non-autoregressive, zero-shot text-to-speech system. It targets users seeking high-quality, fast speech synthesis with voice cloning capabilities, leveraging a flow-matching mel spectrogram generator and a diffusion transformer (DiT).

How It Works

F5-TTS utilizes a flow-matching approach for generating mel spectrograms, combined with a Diffusion Transformer (DiT) for synthesis. It builds upon the E2 TTS architecture, incorporating ConvNeXT v2 blocks to enhance learned text alignment, aiming for improved performance and fidelity. The zero-shot capability allows for voice cloning using reference audio samples.

Quick Start & Requirements

  • Install via pip: pip install f5-tts-mlx
  • Requires macOS with Apple Silicon (MLX framework dependency).
  • Basic usage: python -m f5_tts_mlx.generate --text "..."
  • Voice matching requires a mono, 24kHz WAV file (5-10 seconds).
  • Quantized models (4-bit, 8-bit) are available via the --q flag.
  • Pretrained model weights are available on Hugging Face.

Highlighted Details

  • Generates speech in approximately 4 seconds on an M3 Max MacBook Pro.
  • Supports zero-shot voice cloning with reference audio.
  • Offers quantized models for reduced memory and bandwidth usage.
  • Can be piped with other MLX models, e.g., language models.

Maintenance & Community

This project is based on original implementations by Yushen Chen (F5 TTS) and Phil Wang (E2 TTS). Further community or maintenance details are not specified in the README.

Licensing & Compatibility

  • Released under the MIT license.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The MLX framework is specific to Apple Silicon hardware, limiting its use to macOS users. The project is an implementation of existing research, and its long-term maintenance status is not detailed.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.