WhisperSpeech  by WhisperSpeech

Open-source text-to-speech system built by inverting Whisper

created 2 years ago
4,324 stars

Top 11.5% on sourcepulse

GitHubView on GitHub
Project Summary

WhisperSpeech is an open-source text-to-speech (TTS) system that inverts the Whisper ASR model to generate speech. It aims to be a powerful and customizable TTS solution, akin to Stable Diffusion for speech, targeting researchers and developers interested in advanced audio generation and voice cloning. The system leverages existing state-of-the-art models for its components, enabling high-quality, efficient speech synthesis.

How It Works

WhisperSpeech employs a multi-stage architecture. It uses OpenAI's Whisper model to generate semantic tokens from text, which are then quantized. For acoustic modeling, it utilizes Meta's EnCodec to represent the audio waveform. Finally, Charactr Inc.'s Vocos serves as a high-quality vocoder to synthesize the final audio from the acoustic tokens. This modular approach allows for leveraging and combining the strengths of specialized, pre-trained models.

Quick Start & Requirements

Highlighted Details

  • Achieves over 12x real-time inference speed on an RTX 4090 using torch.compile and KV-caching.
  • Supports voice cloning from reference audio samples.
  • Demonstrates seamless mixing of multiple languages within a single sentence.
  • Trained on the English LibreLight dataset, with ongoing efforts to expand multi-language support.

Maintenance & Community

  • Developed by Collabora and LAION, with significant compute resources provided by the Jülich Supercomputing Centre.
  • Community discussions are available on the LAION Discord server in the #audio-generation channel.
  • Roadmap includes dataset expansion, emotion/prosody conditioning, and community-driven multi-language data collection.

Licensing & Compatibility

  • The code is explicitly stated as Open Source and safe for commercial applications. The specific license is not detailed in the README, but the project emphasizes its open nature.

Limitations & Caveats

  • While inference is fast, training requires significant computational resources.
  • Current models are primarily trained on English, with multi-language capabilities under active development.
  • The README mentions "radio static is a feature, not a bug" for a specific voice cloning sample, suggesting potential artifacts or stylistic choices inherited from training data.
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
102 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
19 more.

whisper by openai

0.4%
86k
Speech recognition model for multilingual transcription/translation
created 2 years ago
updated 1 month ago
Feedback? Help us improve.