PyTorch implementation for zero-shot speech synthesis
Top 9.1% on sourcepulse
MegaTTS 3 is a PyTorch implementation for high-quality, zero-shot voice cloning and bilingual text-to-speech (TTS). It targets researchers and developers needing efficient, controllable speech synthesis with minimal data for new voices. The system offers ultra-high-quality voice cloning and supports English and Chinese with code-switching.
How It Works
MegaTTS 3 utilizes a Diffusion Transformer backbone with 0.45B parameters, enabling efficient inference. It employs a WaveVAE model to compress speech into acoustic latents, which are then used as training targets for the TTS model. This approach allows for more compact representations and faster convergence compared to traditional mel-spectrograms, facilitating high-fidelity speech reconstruction and voice cloning.
Quick Start & Requirements
pip install -r requirements.txt
. Set PYTHONPATH
to the project root.pynini
and WeTextProcessing
versions. Docker support is available but under testing..npy
voice latents generated from .wav
samples.Highlighted Details
.npy
voice latents.Maintenance & Community
The project is primarily intended for academic purposes. Contact information for questions and suggestions is provided via email.
Licensing & Compatibility
Licensed under the Apache-2.0 License. This license permits commercial use and linking with closed-source projects.
Limitations & Caveats
The Windows version is currently under testing. WaveVAE encoder parameters are not provided, requiring users to generate .npy
latents. Docker support is also under testing.
2 months ago
1 day