pits by anonymous-pits

End-to-end TTS for pitch-controllable speech, no external pitch predictor

Created 2 years ago

280 stars

Top 93.0% on SourcePulse

Project Summary

PITS is an end-to-end text-to-speech (TTS) system designed for pitch-controllable speech synthesis without requiring an external pitch predictor. It targets researchers and developers in speech synthesis who need fine-grained control over vocal pitch in generated audio, offering high-quality, natural-sounding speech with controllable pitch variations.

How It Works

PITS builds upon the VITS architecture, incorporating a Yingram encoder and decoder. Its core innovation lies in using variational inference to model pitch, which is claimed to improve pitch variance compared to models that directly regress fundamental frequency. The system also employs adversarial training with pitch-shifted synthesis to enhance pitch controllability without degrading speech quality.

Quick Start & Requirements

Install: docker build -t=pits . (Dockerfile provided)
Prerequisites: PyTorch >= 1.7.0 (Note: PyTorch 2.x is not supported due to dependency on older versions). VCTK dataset (version 0.92) resampled to 22050 Hz, 16-bit WAV format.
Training: Requires multi-speaker datasets (single-speaker datasets may lead to training failures). Training is resource-intensive, estimated over 3 weeks on 4 V100 GPUs.
Demo: Available on Hugging Face Space.
Docs: Training code and audio samples are available on GitHub.

Highlighted Details

Achieves high-quality speech indistinguishable from ground truth.
Demonstrates high pitch-controllability without quality degradation.
Leverages Yingram encoder/decoder and adversarial training for pitch-shifted synthesis.
Accepted to ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling.

Maintenance & Community

Code and audio samples are available on GitHub.
Demo and checkpoints are hosted on Hugging Face Space.
References official VITS, NANSY, Avocodo, and PhaseAug implementations.

Licensing & Compatibility

License details are not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Training on single-speaker datasets is reported to cause failures due to the GAN-based pitch-shift training. The system is sensitive to sampling rates, with 22050 Hz being the recommended and tested rate. Modifying phoneme sets or training for languages other than English requires manual Python file edits and referencing other VITS language variants, respectively.

pits by anonymous-pits

Explore Similar Projects

SpeechGPT-2.0-preview by OpenMOSS

Meta-voicebox by SpeechifyInc

GenerSpeech by Rongjiehuang

DiffGAN-TTS by keonlee9420

speech-synthesis-paper by wenet-e2e

Lip2Wav by Rudrabha

parler-tts by huggingface

higgs-audio by boson-ai

StyleTTS2 by yl4579

Zonos by Zyphra

vits by jaywalnut310

GPT-SoVITS by RVC-Boss