IMS-Toucan by DigitalPhonetics

TTS toolkit for 7000+ languages

Created 4 years ago

2,187 stars

Top 20.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Tim J. Baek

Founder of Open WebUI

Project Summary

IMS-Toucan is a Text-to-Speech (TTS) toolkit designed for training, inference, and teaching state-of-the-art speech synthesis, with a focus on supporting over 7000 languages. It aims to provide a fast, controllable, and computationally efficient solution for researchers and developers in the TTS domain, particularly those working with low-resource languages.

How It Works

The system leverages a massively multilingual architecture, building upon established TTS models like FastSpeech 2 and HiFi-GAN, with components from MatchaTTS and StableTTS. It utilizes an intermediate representation via encodec for efficient data caching. For grapheme-to-phoneme conversion, it integrates eSpeak-NG and transphone, offering flexibility in handling diverse language scripts. The architecture supports controllable synthesis through parameters for duration, pitch, and energy, and enables zero-shot multispeaker capabilities.

Quick Start & Requirements

Installation: Clone the repository and install dependencies via pip install --no-cache-dir -r requirements.txt within a Python 3.10+ virtual environment.
Prerequisites: Linux users require libsndfile1, espeak-ng, ffmpeg, libasound-dev, and libportaudio2. GPU support (CUDA) is recommended for training. eSpeak-NG installation and configuration via PHONEMIZER_ESPEAK_LIBRARY environment variable is necessary for phonemization, with specific instructions for Windows and macOS.
Resources: Pretrained models are downloaded on demand. An interactive demo is available on Hugging Face.

Highlighted Details

Supports over 7000 languages, with a focus on low-resource scenarios.
Offers controllable synthesis parameters (duration, pitch, energy).
Enables zero-shot multispeaker and prosody cloning.
Includes a published massively multilingual TTS dataset.

Maintenance & Community

The project is actively developed by the Institute for Natural Language Processing (IMS), University of Stuttgart. Links to an interactive demo and dataset are provided on Hugging Face.

Licensing & Compatibility

The code and models are free to use. Specific licensing details beyond this are not explicitly stated in the README, but the project's academic origin suggests a permissive license suitable for research.

Limitations & Caveats

The README notes potential warnings related to scheduler/optimizer ordering and xFormers compatibility, which are stated as harmless. Issues with loss turning to NaN may occur with less clean data, suggesting the use of a scorer or reduced learning rates. The torchaudio backend's 'sox_io' is not supported on Windows.

Health Check

Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

28 stars in the last 30 days