soprano  by ekwek1

Ultra-fast, high-fidelity text-to-speech model

Created 1 month ago
745 stars

Top 46.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Soprano is an ultra-lightweight, open-source text-to-speech (TTS) model designed for real-time, high-fidelity speech synthesis. It targets developers and users requiring compact, fast, and easily deployable TTS solutions, offering unprecedented synthesis speed and low VRAM usage (<1 GB) with an 80M parameter model. The primary benefit is achieving a real-time factor (RTF) of ~2000x, enabling near-instantaneous audio generation.

How It Works

Soprano employs a vocoder-based neural decoder utilizing the Vocos architecture, which significantly accelerates waveform generation compared to diffusion models while maintaining perceptual quality. Speech is represented using a neural audio codec compressing audio to ~15 tokens/sec at 0.2 kbps, facilitating rapid generation and efficient memory use. A key innovation is seamless streaming, leveraging the decoder's finite receptive field to achieve ultra-low latency (<15 ms) by starting synthesis after generating just a few audio tokens, producing acoustically identical output to offline synthesis.

Quick Start & Requirements

  • Install: pip install soprano-tts
  • Prerequisites: Linux or Windows, CUDA-enabled GPU required. Specific PyTorch version 2.8.0 with CUDA 12.6 backend is necessary (pip uninstall -y torch && pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126).
  • Links: HuggingFace Model (https://huggingface.co/ekwek/Soprano-80M), HuggingFace Demo (https://huggingface.co/spaces/ekwek/Soprano-TTS).
  • Setup: Installation is straightforward via pip, but requires specific PyTorch and CUDA versions.

Highlighted Details

  • Synthesizes high-fidelity 32 kHz audio.
  • Achieves ~2000x real-time factor (RTF) with <15 ms latency.
  • Operates with under 1 GB VRAM.
  • Utilizes a state-of-the-art neural audio codec for efficient representation.

Maintenance & Community

The project appears to be a personal or academic endeavor by a second-year undergraduate, indicating potential for future development but possibly limited immediate community support or established maintenance processes. No specific community channels (Discord, Slack) or roadmap links are provided beyond the GitHub repository.

Licensing & Compatibility

Licensed under the Apache-2.0 license. This license is permissive and generally compatible with commercial use and closed-source applications.

Limitations & Caveats

The model was pretrained on a relatively small dataset (1000 hours), with quality expected to improve with more data. It is optimized purely for speed and currently lacks features such as voice cloning, style control, and multilingual support. CPU support is listed as "coming soon."

Health Check
Last Commit

17 hours ago

Responsiveness

Inactive

Pull Requests (30d)
15
Issues (30d)
17
Star History
756 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.