StyleTTS  by yl4579

Style-based generative model for natural, diverse text-to-speech synthesis

created 3 years ago
439 stars

Top 69.1% on sourcepulse

GitHubView on GitHub
Project Summary

StyleTTS is an official implementation of a style-based generative model for text-to-speech (TTS) synthesis. It addresses the challenge of producing speech with natural prosodic variations, speaking styles, and emotional tones, which are often difficult for parallel TTS systems. The target audience includes researchers and developers working on advanced TTS systems, and the benefit is the ability to synthesize diverse, natural-sounding speech with controllable styles from reference utterances.

How It Works

StyleTTS employs a novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation. The TMA ensures monotonic alignments crucial for natural speech, while the data augmentation schemes improve robustness. By learning speaking styles through self-supervision, the model can replicate the prosody and emotional tone of a reference speech utterance without explicit style labels, enabling zero-shot style transfer.

Quick Start & Requirements

  • Install via pip: pip install SoundFile torchaudio munch pydub pyyaml librosa git+https://github.com/resemble-ai/monotonic_align.git
  • Requires Python >= 3.7.
  • LJSpeech dataset (upsampled to 24 kHz) or LibriTTS (train-clean-360 combined with train-clean-100, renamed to train-clean-460) is needed.
  • Pre-trained models for StyleTTS and Hifi-GAN are available for download.
  • Inference requires installing phonemizer.
  • Official documentation and audio samples are available at https://styletts.github.io/.

Highlighted Details

  • Outperforms state-of-the-art models in subjective tests for speech naturalness and speaker similarity.
  • Achieves diverse speech synthesis with natural prosody from reference speech.
  • Enables zero-shot style transfer by learning speaking styles through self-supervised learning.
  • Novel Transferable Monotonic Aligner (TMA) for improved alignment.

Maintenance & Community

  • The project is maintained by yl4579.
  • No specific community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

  • The repository does not explicitly state a license. The presence of monotonic_align from resemble-ai might imply a dependency on its license. Users should verify licensing for commercial use.

Limitations & Caveats

The README indicates that provided pre-trained models are specific to certain preprocessing methods (melspectrograms from meldataset.py). Using custom preprocessing will require retraining the text aligner and pitch extractor. The project mentions providing more recipes and pre-trained models "later if I have time," suggesting ongoing development and potential for future improvements or changes.

Health Check
Last commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.