PortaSpeech  by keonlee9420

PyTorch for portable, high-quality generative TTS

Created 4 years ago
341 stars

Top 81.0% on SourcePulse

GitHubView on GitHub
Project Summary

PortaSpeech offers a PyTorch implementation for portable and high-quality generative text-to-speech (TTS). It targets researchers and developers seeking efficient, controllable, and high-fidelity speech synthesis, providing pre-trained models and clear instructions for inference and training.

How It Works

PortaSpeech utilizes a variational generator and a flow-based post-net for high-quality speech synthesis. It incorporates a linguistic encoder and offers controllability over speaking rate via duration ratios, drawing inspiration from FastSpeech2. The architecture is designed to avoid "mashed output" by omitting ReLU activation and LayerNorm in the VariationalGenerator.

Quick Start & Requirements

  • Install dependencies: pip3 install -r requirements.txt
  • Dockerfile is provided.
  • Download pre-trained models and place them in output/ckpt/DATASET/.
  • Inference: python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET
  • Preprocessing requires Montreal Forced Aligner (MFA) for alignment. Pre-extracted alignments are available.
  • Training: python3 train.py --dataset DATASET (supports single-node multi-GPU training and Automatic Mixed Precision).
  • Documentation: demo, prepare_align.py, preprocess.py, train.py

Highlighted Details

  • Offers "Normal" (24M parameters) and "Small" (7.6M parameters) model variants.
  • Supports controllable speaking rate and two helper losses (CTC, DGA) for improved word-to-phoneme alignment.
  • Compatible with HiFi-GAN and MelGAN vocoders.
  • TensorBoard integration for monitoring training progress.

Maintenance & Community

  • The project is maintained by keonlee9420.
  • References include VITS, Glow-TTS, and other TTS projects by the same author.

Licensing & Compatibility

  • The repository does not explicitly state a license in the provided README.

Limitations & Caveats

  • The project is noted to have room for improvement in output quality, with a potential trade-off between audio quality and alignment accuracy.
  • Future extension to multi-speaker TTS is planned.
Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
3 more.

tacotron2 by NVIDIA

0.0%
5k
PyTorch implementation for text-to-speech synthesis
Created 7 years ago
Updated 1 year ago
Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
3 more.

espnet by espnet

0.2%
9k
End-to-end speech processing toolkit for various speech tasks
Created 7 years ago
Updated 3 days ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.