Irodori-TTS  by Aratako

Flow Matching TTS for expressive speech synthesis

Created 1 month ago
375 stars

Top 75.6% on SourcePulse

GitHubView on GitHub
Project Summary

Irodori-TTS is a Flow Matching-based Text-to-Speech (TTS) model that enables high-fidelity speech synthesis with advanced style control. It targets researchers and developers seeking to integrate sophisticated TTS capabilities, offering features like zero-shot voice cloning and emoji-driven style customization, significantly enhancing creative and practical applications of speech generation.

How It Works

The core of Irodori-TTS leverages a Rectified Flow Diffusion Transformer (RF-DiT) operating on continuous latents generated by a DACVAE codec. This approach, inspired by Echo-TTS, allows for high-quality waveform reconstruction from a latent space. The model supports conditioning via text, reference audio for speaker identity, and a novel caption encoder for fine-grained style control, particularly in its "VoiceDesign" variant. This combination provides a flexible and powerful framework for controllable speech synthesis.

Quick Start & Requirements

Installation involves cloning the repository and running uv sync. The project requires PyTorch, which is automatically installed with CUDA 12.8 for Linux/Windows or as a default build for macOS/CPU. Inference can be performed via CLI or a Gradio Web UI. Pre-trained models are available on Hugging Face (e.g., Aratako/Irodori-TTS-500M-v2, Aratako/Irodori-TTS-500M-v2-VoiceDesign), with hosted demos also provided at Aratako/Irodori-TTS-500M-v2-Demo and Aratako/Irodori-TTS-500M-v2-VoiceDesign-Demo respectively.

Highlighted Details

  • Flow Matching TTS utilizing RF-DiT over continuous DACVAE latents.
  • Zero-shot voice cloning from short reference audio samples.
  • "Voice Design" capability for emoji/caption-conditioned style control.
  • Support for multi-GPU distributed training with mixed precision.
  • Parameter-Efficient Fine-Tuning (PEFT) via LoRA for adapting released checkpoints.
  • Flexible inference options including CLI, Gradio UI, and HuggingFace Hub integration.

Maintenance & Community

The project is maintained by Chihiro Arata, with the primary development hosted on GitHub. Specific community channels (e.g., Discord, Slack) or a public roadmap are not detailed in the provided README.

Licensing & Compatibility

The project's code is released under the MIT License, which is permissive for commercial use. However, the licensing for the pre-trained model weights requires users to consult separate model cards on Hugging Face, as these may have different terms.

Limitations & Caveats

The v1 and v2 codebases and their corresponding checkpoints are not compatible, necessitating careful selection of the correct version. The specific licenses for model weights are not detailed within the README itself and must be verified independently for each model.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
1
Star History
342 stars in the last 30 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

ultravox by fixie-ai

0.2%
4k
Multimodal LLM for real-time voice interactions
Created 1 year ago
Updated 4 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.0%
4k
TTS model for human-like, expressive speech
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.