VoiceStar  by jasonppy

Robust, duration-controllable TTS that extrapolates

Created 4 months ago
278 stars

Top 93.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

VoiceStar is a robust, duration-controllable Text-to-Speech (TTS) system designed for high-quality voice synthesis that can extrapolate beyond training data. It targets researchers and developers seeking advanced TTS capabilities, offering controllable speech generation with improved robustness and extrapolation.

How It Works

VoiceStar employs a novel approach that combines a duration predictor with a robust acoustic model, enabling precise control over speech duration. This architecture allows the model to generalize and extrapolate to unseen durations, a common challenge in TTS systems. The use of EnCodec for audio representation contributes to high-fidelity synthesis.

Quick Start & Requirements

  • Install: Create a conda environment (conda create -n voicestar python=3.10), activate it (conda activate voicestar), and install dependencies: pip install torch==2.5.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124 numpy tqdm fire phonemizer==3.2.1 torchmetrics einops omegaconf==2.3.0 openai-whisper gradio. Install espeak-ng via apt-get.
  • Prerequisites: Python 3.10, CUDA 12.4 (for PyTorch), espeak-ng.
  • Model Download: Download EnCodec and VoiceStar model weights from Hugging Face.
  • Inference: Run python inference_commandline.py --reference_speech <path_to_wav> --target_text "<your_text>" --target_duration <seconds>.
  • Demo: A Gradio interface is available via python inference_gradio.py.
  • Setup Time: Estimated setup time is approximately 15-30 minutes, depending on download speeds and environment setup.

Highlighted Details

  • Achieves duration-controllable TTS with extrapolation capabilities.
  • Utilizes EnCodec for high-fidelity audio representation.
  • Provides command-line and Gradio interfaces for inference.
  • Supports fine-tuning and training with additional packages like datasets, tensorboard, and wandb.

Maintenance & Community

The project is maintained by jasonppy. Links to community channels or roadmaps are not provided in the README.

Licensing & Compatibility

  • Code License: MIT
  • Model Weights License: CC-BY-4.0 (due to the Emilia dataset).
  • Compatibility: The CC-BY-4.0 license for model weights may have implications for commercial use or linking in closed-source projects, requiring attribution.

Limitations & Caveats

The README notes a potential warning with phonemizer's words_mismatch.py which can be bypassed by modifying the source code. The model weights are licensed under CC-BY-4.0, which requires attribution and may restrict commercial use.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.4%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.