PortaSpeech by keonlee9420

PyTorch for portable, high-quality generative TTS

Created 4 years ago

341 stars

Top 81.0% on SourcePulse

Project Summary

PortaSpeech offers a PyTorch implementation for portable and high-quality generative text-to-speech (TTS). It targets researchers and developers seeking efficient, controllable, and high-fidelity speech synthesis, providing pre-trained models and clear instructions for inference and training.

How It Works

PortaSpeech utilizes a variational generator and a flow-based post-net for high-quality speech synthesis. It incorporates a linguistic encoder and offers controllability over speaking rate via duration ratios, drawing inspiration from FastSpeech2. The architecture is designed to avoid "mashed output" by omitting ReLU activation and LayerNorm in the VariationalGenerator.

Quick Start & Requirements

Install dependencies: pip3 install -r requirements.txt
Dockerfile is provided.
Download pre-trained models and place them in output/ckpt/DATASET/.
Inference: python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET
Preprocessing requires Montreal Forced Aligner (MFA) for alignment. Pre-extracted alignments are available.
Training: python3 train.py --dataset DATASET (supports single-node multi-GPU training and Automatic Mixed Precision).
Documentation: demo, prepare_align.py, preprocess.py, train.py

Highlighted Details

Offers "Normal" (24M parameters) and "Small" (7.6M parameters) model variants.
Supports controllable speaking rate and two helper losses (CTC, DGA) for improved word-to-phoneme alignment.
Compatible with HiFi-GAN and MelGAN vocoders.
TensorBoard integration for monitoring training progress.

Maintenance & Community

The project is maintained by keonlee9420.
References include VITS, Glow-TTS, and other TTS projects by the same author.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

The project is noted to have room for improvement in output quality, with a potential trade-off between audio quality and alignment accuracy.
Future extension to multi-speaker TTS is planned.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days