WhisperSpeech is an open-source text-to-speech (TTS) system that inverts the Whisper ASR model to generate speech. It aims to be a powerful and customizable TTS solution, akin to Stable Diffusion for speech, targeting researchers and developers interested in advanced audio generation and voice cloning. The system leverages existing state-of-the-art models for its components, enabling high-quality, efficient speech synthesis.
How It Works
WhisperSpeech employs a multi-stage architecture. It uses OpenAI's Whisper model to generate semantic tokens from text, which are then quantized. For acoustic modeling, it utilizes Meta's EnCodec to represent the audio waveform. Finally, Charactr Inc.'s Vocos serves as a high-quality vocoder to synthesize the final audio from the acoustic tokens. This modular approach allows for leveraging and combining the strengths of specialized, pre-trained models.
Quick Start & Requirements
- Install/Run: Recommended to start with the provided Google Colab notebooks for ease of setup and testing.
- Prerequisites: Python, PyTorch. Specific hardware requirements for training are substantial (e.g., GPU, supercomputing resources), but inference is optimized for consumer GPUs (e.g., RTX 4090).
- Links:
Highlighted Details
- Achieves over 12x real-time inference speed on an RTX 4090 using
torch.compile
and KV-caching.
- Supports voice cloning from reference audio samples.
- Demonstrates seamless mixing of multiple languages within a single sentence.
- Trained on the English LibreLight dataset, with ongoing efforts to expand multi-language support.
Maintenance & Community
- Developed by Collabora and LAION, with significant compute resources provided by the Jülich Supercomputing Centre.
- Community discussions are available on the LAION Discord server in the
#audio-generation
channel.
- Roadmap includes dataset expansion, emotion/prosody conditioning, and community-driven multi-language data collection.
Licensing & Compatibility
- The code is explicitly stated as Open Source and safe for commercial applications. The specific license is not detailed in the README, but the project emphasizes its open nature.
Limitations & Caveats
- While inference is fast, training requires significant computational resources.
- Current models are primarily trained on English, with multi-language capabilities under active development.
- The README mentions "radio static is a feature, not a bug" for a specific voice cloning sample, suggesting potential artifacts or stylistic choices inherited from training data.