GenerSpeech  by Rongjiehuang

Text-to-speech model for zero-shot style transfer of custom voice

created 2 years ago
327 stars

Top 84.6% on sourcepulse

GitHubView on GitHub
Project Summary

GenerSpeech is a PyTorch implementation of a text-to-speech (TTS) model designed for zero-shot style transfer of out-of-domain custom voices. It targets researchers and developers working on expressive and customizable speech synthesis, enabling high-fidelity audio generation with novel voice characteristics.

How It Works

GenerSpeech employs a multi-level style transfer approach, enhancing its generalization capabilities to unseen voice styles. It leverages an acoustic model, a neural vocoder (HIFI-GAN), and an emotion encoder to achieve expressive and high-fidelity speech synthesis. This architecture allows for zero-shot style transfer by conditioning the speech generation on a reference audio sample.

Quick Start & Requirements

  • Install: Create and activate a conda environment using conda env create -f environment.yaml and conda activate generspeech.
  • Prerequisites: NVIDIA GPU with CUDA and cuDNN.
  • Pretrained Models: Available for acoustic model, HIFI-GAN vocoder, and emotion encoder.
  • Demo: Available at https://github.com/Rongjiehuang/GenerSpeech (link to demo page not directly provided in README, but implied).

Highlighted Details

  • Zero-shot style transfer for out-of-domain custom voices.
  • Multi-level style transfer for expressive TTS.
  • Enhanced model generalization to OOD style references.
  • PyTorch implementation of NeurIPS'22 paper.

Maintenance & Community

  • Project released in December 2022.
  • Codebase incorporates elements from FastDiff and NATSpeech.
  • Citation details provided for academic use.

Licensing & Compatibility

  • License type not explicitly stated in the README.
  • Disclaimer prohibits generating speech without consent, potentially impacting commercial use or integration with closed-source applications if licensing is restrictive.

Limitations & Caveats

The README includes a disclaimer against unauthorized speech generation, which may impose ethical or legal constraints on usage. Specific licensing details for commercial use are not provided.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.