FastSpeech2 by ming024

PyTorch implementation of FastSpeech 2 for text-to-speech

Created 5 years ago

2,139 stars

Top 20.7% on SourcePulse

Project Summary

This PyTorch implementation of FastSpeech 2 provides an end-to-end text-to-speech system capable of generating high-quality speech with controllable prosody. It targets researchers and developers in speech synthesis, offering support for English and Mandarin, single and multi-speaker models, and integration with popular vocoders like MelGAN and HiFi-GAN.

How It Works

This implementation follows the FastSpeech 2 architecture, utilizing F0 values as pitch features, differing from later versions that use continuous wavelet transform. It employs a non-autoregressive approach, enabling faster inference compared to models like Tacotron 2. The system allows fine-grained control over speaking rate, volume, and pitch by adjusting duration, energy, and pitch ratios during synthesis.

Quick Start & Requirements

Install dependencies: pip3 install -r requirements.txt
Download pretrained models and place them in output/ckpt/ subdirectories.
Inference command example: python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
Requires Python 3.x. Preprocessing relies on Montreal Forced Aligner (MFA).
Training can be efficient, with acceptable quality achieved in under 10k steps on a GTX 1080Ti.
Official audio samples are available.

Highlighted Details

Supports English (LJSpeech, LibriTTS) and Mandarin (AISHELL-3) datasets.
Enables control over pitch, volume, and speaking rate via specific parameters.
Includes batch inference capabilities.
Integrates with MelGAN and HiFi-GAN vocoders.

Maintenance & Community

The project is based on xcmyz's FastSpeech implementation. Updates in 2021 added support for English and Mandarin, multi-speaker models, and vocoder integration. The repository is open for contributions and bug reports.

Licensing & Compatibility

The repository is available under a permissive license, allowing for use and modification. Specific license details are not explicitly stated in the README, but the phrasing "Feel free to use/modify the code" suggests broad compatibility for research and potentially commercial use, though verification is recommended.

Limitations & Caveats

This implementation is noted to use a Tacotron-2-styled Post-Net, which is not part of the original FastSpeech 2 paper. While it uses phoneme-level pitch and energy prediction for better prosody, this deviates from some later FastSpeech 2 variations. Alignment generation requires the Montreal Forced Aligner.

FastSpeech2 by ming024

Explore Similar Projects

PortaSpeech by keonlee9420

Comprehensive-Transformer-TTS by keonlee9420

LLaSM by LinkSoul-AI

fast-voice-assistant by dsa

parrots by shibing624

QuickAgent by gkamradt

TransformerTTS by spring-media

parler-tts by huggingface

VALL-E-X by Plachtaa

Zonos by Zyphra

piper by rhasspy

CosyVoice by FunAudioLLM