Discover and explore top open-source AI tools and projects—updated daily.
bytedancePyTorch implementation for zero-shot speech synthesis
Top 8.5% on SourcePulse
MegaTTS 3 is a PyTorch implementation for high-quality, zero-shot voice cloning and bilingual text-to-speech (TTS). It targets researchers and developers needing efficient, controllable speech synthesis with minimal data for new voices. The system offers ultra-high-quality voice cloning and supports English and Chinese with code-switching.
How It Works
MegaTTS 3 utilizes a Diffusion Transformer backbone with 0.45B parameters, enabling efficient inference. It employs a WaveVAE model to compress speech into acoustic latents, which are then used as training targets for the TTS model. This approach allows for more compact representations and faster convergence compared to traditional mel-spectrograms, facilitating high-fidelity speech reconstruction and voice cloning.
Quick Start & Requirements
pip install -r requirements.txt. Set PYTHONPATH to the project root.pynini and WeTextProcessing versions. Docker support is available but under testing..npy voice latents generated from .wav samples.Highlighted Details
.npy voice latents.Maintenance & Community
The project is primarily intended for academic purposes. Contact information for questions and suggestions is provided via email.
Licensing & Compatibility
Licensed under the Apache-2.0 License. This license permits commercial use and linking with closed-source projects.
Limitations & Caveats
The Windows version is currently under testing. WaveVAE encoder parameters are not provided, requiring users to generate .npy latents. Docker support is also under testing.
2 months ago
1 day
espnet