soprano by ekwek1

Ultra-fast, high-fidelity text-to-speech model

Created 7 months ago

1,248 stars

Top 30.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

Soprano is an ultra-lightweight, open-source text-to-speech (TTS) model designed for real-time, high-fidelity speech synthesis. It targets developers and users requiring compact, fast, and easily deployable TTS solutions, offering unprecedented synthesis speed and low VRAM usage (<1 GB) with an 80M parameter model. The primary benefit is achieving a real-time factor (RTF) of ~2000x, enabling near-instantaneous audio generation.

How It Works

Soprano employs a vocoder-based neural decoder utilizing the Vocos architecture, which significantly accelerates waveform generation compared to diffusion models while maintaining perceptual quality. Speech is represented using a neural audio codec compressing audio to ~15 tokens/sec at 0.2 kbps, facilitating rapid generation and efficient memory use. A key innovation is seamless streaming, leveraging the decoder's finite receptive field to achieve ultra-low latency (<15 ms) by starting synthesis after generating just a few audio tokens, producing acoustically identical output to offline synthesis.

Quick Start & Requirements

Install: pip install soprano-tts
Prerequisites: Linux or Windows, CUDA-enabled GPU required. Specific PyTorch version 2.8.0 with CUDA 12.6 backend is necessary (pip uninstall -y torch && pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126).
Links: HuggingFace Model (https://huggingface.co/ekwek/Soprano-80M), HuggingFace Demo (https://huggingface.co/spaces/ekwek/Soprano-TTS).
Setup: Installation is straightforward via pip, but requires specific PyTorch and CUDA versions.

Highlighted Details

Synthesizes high-fidelity 32 kHz audio.
Achieves ~2000x real-time factor (RTF) with <15 ms latency.
Operates with under 1 GB VRAM.
Utilizes a state-of-the-art neural audio codec for efficient representation.

Maintenance & Community

The project appears to be a personal or academic endeavor by a second-year undergraduate, indicating potential for future development but possibly limited immediate community support or established maintenance processes. No specific community channels (Discord, Slack) or roadmap links are provided beyond the GitHub repository.

Licensing & Compatibility

Licensed under the Apache-2.0 license. This license is permissive and generally compatible with commercial use and closed-source applications.

Limitations & Caveats

The model was pretrained on a relatively small dataset (1000 hours), with quality expected to improve with more data. It is optimized purely for speed and currently lacks features such as voice cloning, style control, and multilingual support. CPU support is listed as "coming soon."

Health Check

Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days