MegaTTS3 by bytedance

PyTorch implementation for zero-shot speech synthesis

Created 9 months ago

6,061 stars

Top 8.4% on SourcePulse

Project Summary

MegaTTS 3 is a PyTorch implementation for high-quality, zero-shot voice cloning and bilingual text-to-speech (TTS). It targets researchers and developers needing efficient, controllable speech synthesis with minimal data for new voices. The system offers ultra-high-quality voice cloning and supports English and Chinese with code-switching.

How It Works

MegaTTS 3 utilizes a Diffusion Transformer backbone with 0.45B parameters, enabling efficient inference. It employs a WaveVAE model to compress speech into acoustic latents, which are then used as training targets for the TTS model. This approach allows for more compact representations and faster convergence compared to traditional mel-spectrograms, facilitating high-fidelity speech reconstruction and voice cloning.

Quick Start & Requirements

Installation: Clone the repository and install dependencies via pip install -r requirements.txt. Set PYTHONPATH to the project root.
Prerequisites: Python 3.10, PyTorch with CUDA 12.6 support for GPU inference. Windows users may need specific pynini and WeTextProcessing versions. Docker support is available but under testing.
Model Download: Pretrained checkpoints are available on Google Drive or Huggingface. WaveVAE encoder parameters are not included; users must provide .npy voice latents generated from .wav samples.
Demo: Huggingface Demo available.
Docs: See link1 and link2 for demos and voice latent generation.

Highlighted Details

Lightweight 0.45B parameter Diffusion Transformer backbone.
Ultra high-quality voice cloning via .npy voice latents.
Bilingual support for English and Chinese, including code-switching.
Controllable accent intensity and fine-grained duration adjustment (upcoming).
Includes a robust speech-text aligner and a Qwen2.5-based graphme-to-phoneme model.

Maintenance & Community

The project is primarily intended for academic purposes. Contact information for questions and suggestions is provided via email.

Licensing & Compatibility

Licensed under the Apache-2.0 License. This license permits commercial use and linking with closed-source projects.

Limitations & Caveats

The Windows version is currently under testing. WaveVAE encoder parameters are not provided, requiring users to generate .npy latents. Docker support is also under testing.

MegaTTS3 by bytedance

Explore Similar Projects

ControlSpeech by jishengpeng

Meta-voicebox by SpeechifyInc

Comprehensive-Transformer-TTS by keonlee9420

DiffGAN-TTS by keonlee9420

naturalspeech2-pytorch by lucidrains

speech-synthesis-paper by wenet-e2e

HierSpeechpp by sh-lee-prml

KittenTTS by KittenML

VALL-E-X by Plachtaa

Zonos by Zyphra

Spark-TTS by SparkAudio

espnet by espnet