IMS-Toucan  by DigitalPhonetics

TTS toolkit for 7000+ languages

created 4 years ago
1,625 stars

Top 26.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

IMS-Toucan is a Text-to-Speech (TTS) toolkit designed for training, inference, and teaching state-of-the-art speech synthesis, with a focus on supporting over 7000 languages. It aims to provide a fast, controllable, and computationally efficient solution for researchers and developers in the TTS domain, particularly those working with low-resource languages.

How It Works

The system leverages a massively multilingual architecture, building upon established TTS models like FastSpeech 2 and HiFi-GAN, with components from MatchaTTS and StableTTS. It utilizes an intermediate representation via encodec for efficient data caching. For grapheme-to-phoneme conversion, it integrates eSpeak-NG and transphone, offering flexibility in handling diverse language scripts. The architecture supports controllable synthesis through parameters for duration, pitch, and energy, and enables zero-shot multispeaker capabilities.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install --no-cache-dir -r requirements.txt within a Python 3.10+ virtual environment.
  • Prerequisites: Linux users require libsndfile1, espeak-ng, ffmpeg, libasound-dev, and libportaudio2. GPU support (CUDA) is recommended for training. eSpeak-NG installation and configuration via PHONEMIZER_ESPEAK_LIBRARY environment variable is necessary for phonemization, with specific instructions for Windows and macOS.
  • Resources: Pretrained models are downloaded on demand. An interactive demo is available on Hugging Face.

Highlighted Details

  • Supports over 7000 languages, with a focus on low-resource scenarios.
  • Offers controllable synthesis parameters (duration, pitch, energy).
  • Enables zero-shot multispeaker and prosody cloning.
  • Includes a published massively multilingual TTS dataset.

Maintenance & Community

The project is actively developed by the Institute for Natural Language Processing (IMS), University of Stuttgart. Links to an interactive demo and dataset are provided on Hugging Face.

Licensing & Compatibility

The code and models are free to use. Specific licensing details beyond this are not explicitly stated in the README, but the project's academic origin suggests a permissive license suitable for research.

Limitations & Caveats

The README notes potential warnings related to scheduler/optimizer ordering and xFormers compatibility, which are stated as harmless. Issues with loss turning to NaN may occur with less clean data, suggesting the use of a scorer or reduced learning rates. The torchaudio backend's 'sox_io' is not supported on Windows.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
48 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Lianmin Zheng Lianmin Zheng(Author of SGLang).

fish-speech by fishaudio

0.3%
23k
Open-source TTS for multilingual speech synthesis
created 1 year ago
updated 1 week ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

GPT-SoVITS by RVC-Boss

0.6%
49k
Few-shot voice cloning and TTS web UI
created 1 year ago
updated 2 weeks ago
Starred by Michael Han Michael Han(Cofounder of Unsloth), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

TTS by coqui-ai

0.4%
42k
Deep learning toolkit for Text-to-Speech, research-tested
created 5 years ago
updated 11 months ago
Feedback? Help us improve.