MegaTTS3  by bytedance

PyTorch implementation for zero-shot speech synthesis

created 4 months ago
5,686 stars

Top 9.1% on sourcepulse

GitHubView on GitHub
Project Summary

MegaTTS 3 is a PyTorch implementation for high-quality, zero-shot voice cloning and bilingual text-to-speech (TTS). It targets researchers and developers needing efficient, controllable speech synthesis with minimal data for new voices. The system offers ultra-high-quality voice cloning and supports English and Chinese with code-switching.

How It Works

MegaTTS 3 utilizes a Diffusion Transformer backbone with 0.45B parameters, enabling efficient inference. It employs a WaveVAE model to compress speech into acoustic latents, which are then used as training targets for the TTS model. This approach allows for more compact representations and faster convergence compared to traditional mel-spectrograms, facilitating high-fidelity speech reconstruction and voice cloning.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt. Set PYTHONPATH to the project root.
  • Prerequisites: Python 3.10, PyTorch with CUDA 12.6 support for GPU inference. Windows users may need specific pynini and WeTextProcessing versions. Docker support is available but under testing.
  • Model Download: Pretrained checkpoints are available on Google Drive or Huggingface. WaveVAE encoder parameters are not included; users must provide .npy voice latents generated from .wav samples.
  • Demo: Huggingface Demo available.
  • Docs: See link1 and link2 for demos and voice latent generation.

Highlighted Details

  • Lightweight 0.45B parameter Diffusion Transformer backbone.
  • Ultra high-quality voice cloning via .npy voice latents.
  • Bilingual support for English and Chinese, including code-switching.
  • Controllable accent intensity and fine-grained duration adjustment (upcoming).
  • Includes a robust speech-text aligner and a Qwen2.5-based graphme-to-phoneme model.

Maintenance & Community

The project is primarily intended for academic purposes. Contact information for questions and suggestions is provided via email.

Licensing & Compatibility

Licensed under the Apache-2.0 License. This license permits commercial use and linking with closed-source projects.

Limitations & Caveats

The Windows version is currently under testing. WaveVAE encoder parameters are not provided, requiring users to generate .npy latents. Docker support is also under testing.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
4
Star History
713 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.