TTS toolkit for 7000+ languages
Top 26.4% on sourcepulse
IMS-Toucan is a Text-to-Speech (TTS) toolkit designed for training, inference, and teaching state-of-the-art speech synthesis, with a focus on supporting over 7000 languages. It aims to provide a fast, controllable, and computationally efficient solution for researchers and developers in the TTS domain, particularly those working with low-resource languages.
How It Works
The system leverages a massively multilingual architecture, building upon established TTS models like FastSpeech 2 and HiFi-GAN, with components from MatchaTTS and StableTTS. It utilizes an intermediate representation via encodec for efficient data caching. For grapheme-to-phoneme conversion, it integrates eSpeak-NG and transphone, offering flexibility in handling diverse language scripts. The architecture supports controllable synthesis through parameters for duration, pitch, and energy, and enables zero-shot multispeaker capabilities.
Quick Start & Requirements
pip install --no-cache-dir -r requirements.txt
within a Python 3.10+ virtual environment.libsndfile1
, espeak-ng
, ffmpeg
, libasound-dev
, and libportaudio2
. GPU support (CUDA) is recommended for training. eSpeak-NG installation and configuration via PHONEMIZER_ESPEAK_LIBRARY
environment variable is necessary for phonemization, with specific instructions for Windows and macOS.Highlighted Details
Maintenance & Community
The project is actively developed by the Institute for Natural Language Processing (IMS), University of Stuttgart. Links to an interactive demo and dataset are provided on Hugging Face.
Licensing & Compatibility
The code and models are free to use. Specific licensing details beyond this are not explicitly stated in the README, but the project's academic origin suggests a permissive license suitable for research.
Limitations & Caveats
The README notes potential warnings related to scheduler/optimizer ordering and xFormers compatibility, which are stated as harmless. Issues with loss turning to NaN may occur with less clean data, suggesting the use of a scorer or reduced learning rates. The torchaudio backend's 'sox_io' is not supported on Windows.
1 month ago
Inactive