IMS-Toucan  by DigitalPhonetics

TTS toolkit for 7000+ languages

Created 4 years ago
1,651 stars

Top 25.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

IMS-Toucan is a Text-to-Speech (TTS) toolkit designed for training, inference, and teaching state-of-the-art speech synthesis, with a focus on supporting over 7000 languages. It aims to provide a fast, controllable, and computationally efficient solution for researchers and developers in the TTS domain, particularly those working with low-resource languages.

How It Works

The system leverages a massively multilingual architecture, building upon established TTS models like FastSpeech 2 and HiFi-GAN, with components from MatchaTTS and StableTTS. It utilizes an intermediate representation via encodec for efficient data caching. For grapheme-to-phoneme conversion, it integrates eSpeak-NG and transphone, offering flexibility in handling diverse language scripts. The architecture supports controllable synthesis through parameters for duration, pitch, and energy, and enables zero-shot multispeaker capabilities.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install --no-cache-dir -r requirements.txt within a Python 3.10+ virtual environment.
  • Prerequisites: Linux users require libsndfile1, espeak-ng, ffmpeg, libasound-dev, and libportaudio2. GPU support (CUDA) is recommended for training. eSpeak-NG installation and configuration via PHONEMIZER_ESPEAK_LIBRARY environment variable is necessary for phonemization, with specific instructions for Windows and macOS.
  • Resources: Pretrained models are downloaded on demand. An interactive demo is available on Hugging Face.

Highlighted Details

  • Supports over 7000 languages, with a focus on low-resource scenarios.
  • Offers controllable synthesis parameters (duration, pitch, energy).
  • Enables zero-shot multispeaker and prosody cloning.
  • Includes a published massively multilingual TTS dataset.

Maintenance & Community

The project is actively developed by the Institute for Natural Language Processing (IMS), University of Stuttgart. Links to an interactive demo and dataset are provided on Hugging Face.

Licensing & Compatibility

The code and models are free to use. Specific licensing details beyond this are not explicitly stated in the README, but the project's academic origin suggests a permissive license suitable for research.

Limitations & Caveats

The README notes potential warnings related to scheduler/optimizer ordering and xFormers compatibility, which are stated as harmless. Issues with loss turning to NaN may occur with less clean data, suggesting the use of a scorer or reduced learning rates. The torchaudio backend's 'sox_io' is not supported on Windows.

Health Check
Last Commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Jiaming Song Jiaming Song(Chief Scientist at Luma AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

fish-speech by fishaudio

0.4%
24k
Open-source TTS for multilingual speech synthesis
Created 2 years ago
Updated 1 day ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
52k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.