Discover and explore top open-source AI tools and projects—updated daily.
Robust, duration-controllable TTS that extrapolates
Top 93.3% on SourcePulse
VoiceStar is a robust, duration-controllable Text-to-Speech (TTS) system designed for high-quality voice synthesis that can extrapolate beyond training data. It targets researchers and developers seeking advanced TTS capabilities, offering controllable speech generation with improved robustness and extrapolation.
How It Works
VoiceStar employs a novel approach that combines a duration predictor with a robust acoustic model, enabling precise control over speech duration. This architecture allows the model to generalize and extrapolate to unseen durations, a common challenge in TTS systems. The use of EnCodec for audio representation contributes to high-fidelity synthesis.
Quick Start & Requirements
conda create -n voicestar python=3.10
), activate it (conda activate voicestar
), and install dependencies: pip install torch==2.5.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124 numpy tqdm fire phonemizer==3.2.1 torchmetrics einops omegaconf==2.3.0 openai-whisper gradio
. Install espeak-ng
via apt-get
.espeak-ng
.python inference_commandline.py --reference_speech <path_to_wav> --target_text "<your_text>" --target_duration <seconds>
.python inference_gradio.py
.Highlighted Details
datasets
, tensorboard
, and wandb
.Maintenance & Community
The project is maintained by jasonppy. Links to community channels or roadmaps are not provided in the README.
Licensing & Compatibility
Limitations & Caveats
The README notes a potential warning with phonemizer
's words_mismatch.py
which can be bypassed by modifying the source code. The model weights are licensed under CC-BY-4.0, which requires attribution and may restrict commercial use.
3 months ago
Inactive