StyleSpeech  by KevinMIN95

Multi-speaker adaptive TTS generation

Created 4 years ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
Project Summary

Meta-StyleSpeech is a text-to-speech (TTS) system designed for multi-speaker adaptive speech generation, enabling high-quality, personalized voice synthesis from minimal reference audio. It targets researchers and developers in speech synthesis seeking efficient speaker adaptation without extensive fine-tuning.

How It Works

The core innovation is Style-Adaptive Layer Normalization (SALN), which aligns model parameters based on speaker style extracted from a reference audio sample. Meta-StyleSpeech further enhances adaptation through discriminators trained with style prototypes and episodic training, allowing for rapid learning of new speaker voices from very short (1-3 second) audio clips.

Quick Start & Requirements

  • Install: Clone the repository and install Python requirements from requirements.txt.
  • Prerequisites: Requires pre-trained models (links provided in README) and a reference speech audio sample. Montreal Forced Aligner (MFA) is used for dataset preprocessing.
  • Inference: python synthesize.py --text <raw text> --ref_audio <path to reference audio> --checkpoint_path <path to pretrained model>
  • Dataset: Trained on LibriTTS. Preprocessing involves resampling, forced alignment with MFA, and generating mel-spectrograms, durations, pitch, and energy.
  • Links: [Demo audio samples](demo page)

Highlighted Details

  • Achieves high-quality speech synthesis accurately following a target speaker's voice.
  • Adapts to new speakers using only single, short-duration (1-3 sec) speech audio samples.
  • Outperforms baseline methods in speaker adaptation quality.
  • Offers official implementations for both StyleSpeech and Meta-StyleSpeech.

Maintenance & Community

The project was last updated in December 2021. No community links (Discord, Slack) or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. It references other projects, implying potential licensing considerations from those dependencies. Commercial use compatibility is not specified.

Limitations & Caveats

The project's last update was in late 2021, suggesting potential for unaddressed issues or lack of ongoing development. No explicit information is provided regarding compatibility with newer Python versions or hardware accelerators beyond what might be inferred from dependencies.

Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.