HierSpeechpp by sh-lee-prml

PyTorch for zero-shot TTS/voice conversion research

Created 2 years ago

1,244 stars

Top 31.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

HierSpeech++ is a PyTorch implementation for zero-shot speech synthesis and voice conversion, aiming to bridge the gap between semantic and acoustic representations. It offers a fast and robust alternative to LLM-based and diffusion-based models, targeting researchers and developers in speech technology.

How It Works

HierSpeech++ employs a hierarchical variational inference framework. For text-to-speech, it uses a text-to-vector (TTV) approach to generate speech representations and prosody prompts from text. The core Hierarchical Speech Synthesizer then generates speech from these vectors, augmented by a voice prompt. A high-efficiency speech super-resolution module upscales audio from 16 kHz to 48 kHz. This hierarchical structure is claimed to significantly improve robustness and expressiveness.

Quick Start & Requirements

Install: pip install -r requirements.txt and pip install phonemizer. Requires espeak-ng (sudo apt-get install espeak-ng).
Prerequisites: PyTorch >= 1.13, torchaudio >= 0.13.
Checkpoints: Pre-trained models for Hierarchical Speech Synthesizer and TTV are available for download.
Demo: A Gradio demo is available on HuggingFace.
Inference: Run inference.sh or inference_vc.sh with specified checkpoints.

Highlighted Details

Achieves human-level quality in zero-shot speech synthesis.
Outperforms LLM-based and diffusion-based models in experiments.
Supports both Text-to-Speech (TTS) and Voice Conversion (VC).
Includes a 16 kHz to 48 kHz speech super-resolution framework.

Maintenance & Community

The project is an extension of previous works like HierSpeech and HierVST. Updates are periodically released, with ongoing work on TTV-v2 and code cleanup. Links to previous works and baseline models are provided.

Licensing & Compatibility

The repository is released under the MIT License, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The project notes slow training speed and a relatively large model size compared to simpler models like VITS. It cannot currently generate realistic background sounds and may struggle with very long sentences due to training settings. The denoiser component may cause out-of-memory issues with long reference audio.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days