VoiceStar by jasonppy

Robust, duration-controllable TTS that extrapolates

Created 9 months ago

306 stars

Top 87.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

VoiceStar is a robust, duration-controllable Text-to-Speech (TTS) system designed for high-quality voice synthesis that can extrapolate beyond training data. It targets researchers and developers seeking advanced TTS capabilities, offering controllable speech generation with improved robustness and extrapolation.

How It Works

VoiceStar employs a novel approach that combines a duration predictor with a robust acoustic model, enabling precise control over speech duration. This architecture allows the model to generalize and extrapolate to unseen durations, a common challenge in TTS systems. The use of EnCodec for audio representation contributes to high-fidelity synthesis.

Quick Start & Requirements

Install: Create a conda environment (conda create -n voicestar python=3.10), activate it (conda activate voicestar), and install dependencies: pip install torch==2.5.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124 numpy tqdm fire phonemizer==3.2.1 torchmetrics einops omegaconf==2.3.0 openai-whisper gradio. Install espeak-ng via apt-get.
Prerequisites: Python 3.10, CUDA 12.4 (for PyTorch), espeak-ng.
Model Download: Download EnCodec and VoiceStar model weights from Hugging Face.
Inference: Run python inference_commandline.py --reference_speech <path_to_wav> --target_text "<your_text>" --target_duration <seconds>.
Demo: A Gradio interface is available via python inference_gradio.py.
Setup Time: Estimated setup time is approximately 15-30 minutes, depending on download speeds and environment setup.

Highlighted Details

Achieves duration-controllable TTS with extrapolation capabilities.
Utilizes EnCodec for high-fidelity audio representation.
Provides command-line and Gradio interfaces for inference.
Supports fine-tuning and training with additional packages like datasets, tensorboard, and wandb.

Maintenance & Community

The project is maintained by jasonppy. Links to community channels or roadmaps are not provided in the README.

Licensing & Compatibility

Code License: MIT
Model Weights License: CC-BY-4.0 (due to the Emilia dataset).
Compatibility: The CC-BY-4.0 license for model weights may have implications for commercial use or linking in closed-source projects, requiring attribution.

Limitations & Caveats

The README notes a potential warning with phonemizer's words_mismatch.py which can be bypassed by modifying the source code. The model weights are licensed under CC-BY-4.0, which requires attribution and may restrict commercial use.

Health Check

Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days