Discover and explore top open-source AI tools and projects—updated daily.
Open-source text-to-speech system built by inverting Whisper
Top 11.2% on SourcePulse
WhisperSpeech is an open-source text-to-speech (TTS) system that inverts the Whisper ASR model to generate speech. It aims to be a powerful and customizable TTS solution, akin to Stable Diffusion for speech, targeting researchers and developers interested in advanced audio generation and voice cloning. The system leverages existing state-of-the-art models for its components, enabling high-quality, efficient speech synthesis.
How It Works
WhisperSpeech employs a multi-stage architecture. It uses OpenAI's Whisper model to generate semantic tokens from text, which are then quantized. For acoustic modeling, it utilizes Meta's EnCodec to represent the audio waveform. Finally, Charactr Inc.'s Vocos serves as a high-quality vocoder to synthesize the final audio from the acoustic tokens. This modular approach allows for leveraging and combining the strengths of specialized, pre-trained models.
Quick Start & Requirements
Highlighted Details
torch.compile
and KV-caching.Maintenance & Community
#audio-generation
channel.Licensing & Compatibility
Limitations & Caveats
3 months ago
1 day