Irodori-TTS by Aratako

Flow Matching TTS for expressive speech synthesis

Created 4 months ago

1,016 stars

Top 36.0% on SourcePulse

Project Summary

Irodori-TTS is a Flow Matching-based Text-to-Speech (TTS) model that enables high-fidelity speech synthesis with advanced style control. It targets researchers and developers seeking to integrate sophisticated TTS capabilities, offering features like zero-shot voice cloning and emoji-driven style customization, significantly enhancing creative and practical applications of speech generation.

How It Works

The core of Irodori-TTS leverages a Rectified Flow Diffusion Transformer (RF-DiT) operating on continuous latents generated by a DACVAE codec. This approach, inspired by Echo-TTS, allows for high-quality waveform reconstruction from a latent space. The model supports conditioning via text, reference audio for speaker identity, and a novel caption encoder for fine-grained style control, particularly in its "VoiceDesign" variant. This combination provides a flexible and powerful framework for controllable speech synthesis.

Quick Start & Requirements

Installation involves cloning the repository and running uv sync. The project requires PyTorch, which is automatically installed with CUDA 12.8 for Linux/Windows or as a default build for macOS/CPU. Inference can be performed via CLI or a Gradio Web UI. Pre-trained models are available on Hugging Face (e.g., Aratako/Irodori-TTS-500M-v2, Aratako/Irodori-TTS-500M-v2-VoiceDesign), with hosted demos also provided at Aratako/Irodori-TTS-500M-v2-Demo and Aratako/Irodori-TTS-500M-v2-VoiceDesign-Demo respectively.

Highlighted Details

Flow Matching TTS utilizing RF-DiT over continuous DACVAE latents.
Zero-shot voice cloning from short reference audio samples.
"Voice Design" capability for emoji/caption-conditioned style control.
Support for multi-GPU distributed training with mixed precision.
Parameter-Efficient Fine-Tuning (PEFT) via LoRA for adapting released checkpoints.
Flexible inference options including CLI, Gradio UI, and HuggingFace Hub integration.

Maintenance & Community

The project is maintained by Chihiro Arata, with the primary development hosted on GitHub. Specific community channels (e.g., Discord, Slack) or a public roadmap are not detailed in the provided README.

Licensing & Compatibility

The project's code is released under the MIT License, which is permissive for commercial use. However, the licensing for the pre-trained model weights requires users to consult separate model cards on Hugging Face, as these may have different terms.

Limitations & Caveats

The v1 and v2 codebases and their corresponding checkpoints are not compatible, necessitating careful selection of the correct version. The specific licenses for model weights are not detailed within the README itself and must be verified independently for each model.

Irodori-TTS by Aratako

Explore Similar Projects

Lyra by JIA-Lab-research

VoiceStar by jasonppy

ComfyUI-FishAudioS2 by Saganaki22

GenerSpeech by Rongjiehuang

ComfyUI-VoxCPM by wildminder

GLM-TTS by zai-org

ComfyUI-Qwen-TTS by flybirdxx

ultravox by fixie-ai

parler-tts by huggingface

metavoice-src by metavoiceio

VITS-fast-fine-tuning by Plachtaa

CosyVoice by FunAudioLLM