T5Gemma-TTS by Aratako

LLM-powered multilingual TTS with voice cloning

Created 2 months ago

278 stars

Top 93.6% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Aratako/T5Gemma-TTS is a multilingual Text-to-Speech (TTS) system built on the T5Gemma encoder-decoder LLM architecture. It addresses the need for flexible, high-quality speech synthesis with advanced features like zero-shot voice cloning and explicit duration control. Targeted at researchers and power users, it offers a powerful tool for generating diverse and controllable audio outputs.

How It Works

The system leverages the T5Gemma LLM architecture for text-to-speech conversion, supporting English, Chinese, and Japanese. Key functionalities include zero-shot voice cloning from reference audio and explicit duration control for generated speech length. This approach combines LLM linguistic understanding with robust audio generation techniques.

Quick Start & Requirements

Installation requires cloning the repo and running pip install -r requirements.txt. GPU support necessitates PyTorch with CUDA (e.g., pip install "torch<=2.8.0" torchaudio --index-url https://download.pytorch.org/whl/cu128); Apple Silicon (MPS) is also supported. Quantized models (8-bit/4-bit encoder) reduce VRAM needs (4-bit model ~7.6 GB). Inference is available via command-line scripts or a Gradio web UI.

Highlighted Details

Multilingual Support: English, Chinese, Japanese.
Voice Cloning: Zero-shot from reference audio.
Duration Control: Explicit target duration specification.
Batch Generation: Parallel audio variation generation.
Quantized Models: 8-bit/4-bit encoder quantization for reduced VRAM.
Low-VRAM Options: CPU offloading for codecs/transcription.
Flexible Training: Full, fine-tuning, and LoRA fine-tuning scripts.

Maintenance & Community

No specific details regarding maintainers, community channels, or active development signals were found in the provided README.

Licensing & Compatibility

Code is MIT licensed, generally permitting commercial use. Model weight licensing is detailed separately in the model card.

Limitations & Caveats

Inference is not real-time due to autoregressive generation. Duration control is approximate, and speech pacing/naturalness may vary. Audio quality depends on training data, potentially underperforming for underrepresented voices. Native Windows inference can be unstable; WSL2 or Docker is recommended.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days