StyleTTS by yl4579

Style-based generative model for natural, diverse text-to-speech synthesis

Created 3 years ago

457 stars

Top 66.1% on SourcePulse

Project Summary

StyleTTS is an official implementation of a style-based generative model for text-to-speech (TTS) synthesis. It addresses the challenge of producing speech with natural prosodic variations, speaking styles, and emotional tones, which are often difficult for parallel TTS systems. The target audience includes researchers and developers working on advanced TTS systems, and the benefit is the ability to synthesize diverse, natural-sounding speech with controllable styles from reference utterances.

How It Works

StyleTTS employs a novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation. The TMA ensures monotonic alignments crucial for natural speech, while the data augmentation schemes improve robustness. By learning speaking styles through self-supervision, the model can replicate the prosody and emotional tone of a reference speech utterance without explicit style labels, enabling zero-shot style transfer.

Quick Start & Requirements

Install via pip: pip install SoundFile torchaudio munch pydub pyyaml librosa git+https://github.com/resemble-ai/monotonic_align.git
Requires Python >= 3.7.
LJSpeech dataset (upsampled to 24 kHz) or LibriTTS (train-clean-360 combined with train-clean-100, renamed to train-clean-460) is needed.
Pre-trained models for StyleTTS and Hifi-GAN are available for download.
Inference requires installing phonemizer.
Official documentation and audio samples are available at https://styletts.github.io/.

Highlighted Details

Outperforms state-of-the-art models in subjective tests for speech naturalness and speaker similarity.
Achieves diverse speech synthesis with natural prosody from reference speech.
Enables zero-shot style transfer by learning speaking styles through self-supervised learning.
Novel Transferable Monotonic Aligner (TMA) for improved alignment.

Maintenance & Community

The project is maintained by yl4579.
No specific community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The presence of monotonic_align from resemble-ai might imply a dependency on its license. Users should verify licensing for commercial use.

Limitations & Caveats

The README indicates that provided pre-trained models are specific to certain preprocessing methods (melspectrograms from meldataset.py). Using custom preprocessing will require retraining the text aligner and pitch extractor. The project mentions providing more recipes and pre-trained models "later if I have time," suggesting ongoing development and potential for future improvements or changes.

StyleTTS by yl4579

Explore Similar Projects

Meta-voicebox by SpeechifyInc

StyleSpeech by KevinMIN95

GenerSpeech by Rongjiehuang

TCSinger by AaronZ345

StarGANv2-VC by yl4579

glow-tts by jaywalnut310

flowtron by NVIDIA

higgs-audio by boson-ai

StyleTTS2 by yl4579

metavoice-src by metavoiceio

Zonos by Zyphra

vits by jaywalnut310