FastSpeech2  by ming024

PyTorch implementation of FastSpeech 2 for text-to-speech

created 5 years ago
2,065 stars

Top 22.0% on sourcepulse

GitHubView on GitHub
Project Summary

This PyTorch implementation of FastSpeech 2 provides an end-to-end text-to-speech system capable of generating high-quality speech with controllable prosody. It targets researchers and developers in speech synthesis, offering support for English and Mandarin, single and multi-speaker models, and integration with popular vocoders like MelGAN and HiFi-GAN.

How It Works

This implementation follows the FastSpeech 2 architecture, utilizing F0 values as pitch features, differing from later versions that use continuous wavelet transform. It employs a non-autoregressive approach, enabling faster inference compared to models like Tacotron 2. The system allows fine-grained control over speaking rate, volume, and pitch by adjusting duration, energy, and pitch ratios during synthesis.

Quick Start & Requirements

  • Install dependencies: pip3 install -r requirements.txt
  • Download pretrained models and place them in output/ckpt/ subdirectories.
  • Inference command example: python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
  • Requires Python 3.x. Preprocessing relies on Montreal Forced Aligner (MFA).
  • Training can be efficient, with acceptable quality achieved in under 10k steps on a GTX 1080Ti.
  • Official audio samples are available.

Highlighted Details

  • Supports English (LJSpeech, LibriTTS) and Mandarin (AISHELL-3) datasets.
  • Enables control over pitch, volume, and speaking rate via specific parameters.
  • Includes batch inference capabilities.
  • Integrates with MelGAN and HiFi-GAN vocoders.

Maintenance & Community

The project is based on xcmyz's FastSpeech implementation. Updates in 2021 added support for English and Mandarin, multi-speaker models, and vocoder integration. The repository is open for contributions and bug reports.

Licensing & Compatibility

The repository is available under a permissive license, allowing for use and modification. Specific license details are not explicitly stated in the README, but the phrasing "Feel free to use/modify the code" suggests broad compatibility for research and potentially commercial use, though verification is recommended.

Limitations & Caveats

This implementation is noted to use a Tacotron-2-styled Post-Net, which is not part of the original FastSpeech 2 paper. While it uses phoneme-level pitch and energy prediction for better prosody, this deviates from some later FastSpeech 2 variations. Alignment generation requires the Montreal Forced Aligner.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
64 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.