pits  by anonymous-pits

End-to-end TTS for pitch-controllable speech, no external pitch predictor

created 2 years ago
278 stars

Top 94.3% on sourcepulse

GitHubView on GitHub
Project Summary

PITS is an end-to-end text-to-speech (TTS) system designed for pitch-controllable speech synthesis without requiring an external pitch predictor. It targets researchers and developers in speech synthesis who need fine-grained control over vocal pitch in generated audio, offering high-quality, natural-sounding speech with controllable pitch variations.

How It Works

PITS builds upon the VITS architecture, incorporating a Yingram encoder and decoder. Its core innovation lies in using variational inference to model pitch, which is claimed to improve pitch variance compared to models that directly regress fundamental frequency. The system also employs adversarial training with pitch-shifted synthesis to enhance pitch controllability without degrading speech quality.

Quick Start & Requirements

  • Install: docker build -t=pits . (Dockerfile provided)
  • Prerequisites: PyTorch >= 1.7.0 (Note: PyTorch 2.x is not supported due to dependency on older versions). VCTK dataset (version 0.92) resampled to 22050 Hz, 16-bit WAV format.
  • Training: Requires multi-speaker datasets (single-speaker datasets may lead to training failures). Training is resource-intensive, estimated over 3 weeks on 4 V100 GPUs.
  • Demo: Available on Hugging Face Space.
  • Docs: Training code and audio samples are available on GitHub.

Highlighted Details

  • Achieves high-quality speech indistinguishable from ground truth.
  • Demonstrates high pitch-controllability without quality degradation.
  • Leverages Yingram encoder/decoder and adversarial training for pitch-shifted synthesis.
  • Accepted to ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling.

Maintenance & Community

  • Code and audio samples are available on GitHub.
  • Demo and checkpoints are hosted on Hugging Face Space.
  • References official VITS, NANSY, Avocodo, and PhaseAug implementations.

Licensing & Compatibility

  • License details are not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Training on single-speaker datasets is reported to cause failures due to the GAN-based pitch-shift training. The system is sensitive to sampling rates, with 22050 Hz being the recommended and tested rate. Modifying phoneme sets or training for languages other than English requires manual Python file edits and referencing other VITS language variants, respectively.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.