HierSpeechpp  by sh-lee-prml

PyTorch for zero-shot TTS/voice conversion research

created 1 year ago
1,224 stars

Top 32.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

HierSpeech++ is a PyTorch implementation for zero-shot speech synthesis and voice conversion, aiming to bridge the gap between semantic and acoustic representations. It offers a fast and robust alternative to LLM-based and diffusion-based models, targeting researchers and developers in speech technology.

How It Works

HierSpeech++ employs a hierarchical variational inference framework. For text-to-speech, it uses a text-to-vector (TTV) approach to generate speech representations and prosody prompts from text. The core Hierarchical Speech Synthesizer then generates speech from these vectors, augmented by a voice prompt. A high-efficiency speech super-resolution module upscales audio from 16 kHz to 48 kHz. This hierarchical structure is claimed to significantly improve robustness and expressiveness.

Quick Start & Requirements

  • Install: pip install -r requirements.txt and pip install phonemizer. Requires espeak-ng (sudo apt-get install espeak-ng).
  • Prerequisites: PyTorch >= 1.13, torchaudio >= 0.13.
  • Checkpoints: Pre-trained models for Hierarchical Speech Synthesizer and TTV are available for download.
  • Demo: A Gradio demo is available on HuggingFace.
  • Inference: Run inference.sh or inference_vc.sh with specified checkpoints.

Highlighted Details

  • Achieves human-level quality in zero-shot speech synthesis.
  • Outperforms LLM-based and diffusion-based models in experiments.
  • Supports both Text-to-Speech (TTS) and Voice Conversion (VC).
  • Includes a 16 kHz to 48 kHz speech super-resolution framework.

Maintenance & Community

The project is an extension of previous works like HierSpeech and HierVST. Updates are periodically released, with ongoing work on TTV-v2 and code cleanup. Links to previous works and baseline models are provided.

Licensing & Compatibility

The repository is released under the MIT License, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The project notes slow training speed and a relatively large model size compared to simpler models like VITS. It cannot currently generate realistic background sounds and may struggle with very long sentences due to training settings. The denoiser component may cause out-of-memory issues with long reference audio.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
1 more.

metavoice-src by metavoiceio

0%
4k
TTS model for human-like, expressive speech
created 1 year ago
updated 1 year ago
Feedback? Help us improve.