WhisperSpeech by WhisperSpeech

Open-source text-to-speech system built by inverting Whisper

Created 2 years ago

4,551 stars

Top 10.7% on SourcePulse

View on GitHub

5 Experts Love This Project

Omar Sanseviero

DevRel at Google DeepMind

Jonathan Ragan-Kelley

Professor at MIT

Benjamin Bolte

Cofounder of K-Scale Labs

Luis Capelo

Cofounder of Lightning AI

and 1 more!

Project Summary

WhisperSpeech is an open-source text-to-speech (TTS) system that inverts the Whisper ASR model to generate speech. It aims to be a powerful and customizable TTS solution, akin to Stable Diffusion for speech, targeting researchers and developers interested in advanced audio generation and voice cloning. The system leverages existing state-of-the-art models for its components, enabling high-quality, efficient speech synthesis.

How It Works

WhisperSpeech employs a multi-stage architecture. It uses OpenAI's Whisper model to generate semantic tokens from text, which are then quantized. For acoustic modeling, it utilizes Meta's EnCodec to represent the audio waveform. Finally, Charactr Inc.'s Vocos serves as a high-quality vocoder to synthesize the final audio from the acoustic tokens. This modular approach allows for leveraging and combining the strengths of specialized, pre-trained models.

Quick Start & Requirements

Install/Run: Recommended to start with the provided Google Colab notebooks for ease of setup and testing.
Prerequisites: Python, PyTorch. Specific hardware requirements for training are substantial (e.g., GPU, supercomputing resources), but inference is optimized for consumer GPUs (e.g., RTX 4090).
Links:
- Colab: https://github.com/collabora/WhisperSpeech/blob/main/colab/
- Pre-trained Models & Datasets: https://huggingface.co/WhisperSpeech

Highlighted Details

Achieves over 12x real-time inference speed on an RTX 4090 using torch.compile and KV-caching.
Supports voice cloning from reference audio samples.
Demonstrates seamless mixing of multiple languages within a single sentence.
Trained on the English LibreLight dataset, with ongoing efforts to expand multi-language support.

Maintenance & Community

Developed by Collabora and LAION, with significant compute resources provided by the Jülich Supercomputing Centre.
Community discussions are available on the LAION Discord server in the #audio-generation channel.
Roadmap includes dataset expansion, emotion/prosody conditioning, and community-driven multi-language data collection.

Licensing & Compatibility

The code is explicitly stated as Open Source and safe for commercial applications. The specific license is not detailed in the README, but the project emphasizes its open nature.

Limitations & Caveats

While inference is fast, training requires significant computational resources.
Current models are primarily trained on English, with multi-language capabilities under active development.
The README mentions "radio static is a feature, not a bug" for a specific voice cloning sample, suggesting potential artifacts or stylistic choices inherited from training data.

Health Check

Last Commit

4 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

18 stars in the last 30 days