End-to-end TTS for pitch-controllable speech, no external pitch predictor
Top 94.3% on sourcepulse
PITS is an end-to-end text-to-speech (TTS) system designed for pitch-controllable speech synthesis without requiring an external pitch predictor. It targets researchers and developers in speech synthesis who need fine-grained control over vocal pitch in generated audio, offering high-quality, natural-sounding speech with controllable pitch variations.
How It Works
PITS builds upon the VITS architecture, incorporating a Yingram encoder and decoder. Its core innovation lies in using variational inference to model pitch, which is claimed to improve pitch variance compared to models that directly regress fundamental frequency. The system also employs adversarial training with pitch-shifted synthesis to enhance pitch controllability without degrading speech quality.
Quick Start & Requirements
docker build -t=pits .
(Dockerfile provided)Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Training on single-speaker datasets is reported to cause failures due to the GAN-based pitch-shift training. The system is sensitive to sampling rates, with 22050 Hz being the recommended and tested rate. Modifying phoneme sets or training for languages other than English requires manual Python file edits and referencing other VITS language variants, respectively.
2 years ago
Inactive