pheme  by PolyAI-LDN

TTS framework for efficient, conversational speech generation (research paper)

created 1 year ago
260 stars

Top 98.2% on sourcepulse

GitHubView on GitHub
Project Summary

Pheme is an open-source framework for training efficient and conversational Text-to-Speech (TTS) models, designed for researchers and developers seeking high-quality speech synthesis with reduced data and computational requirements. It enables training Transformer-based models using significantly less data than comparable systems like VALL-E or SoundStorm, while supporting diverse data sources including conversational, podcast, and noisy audio.

How It Works

Pheme separates semantic and acoustic tokens, leveraging a specialized speech tokenizer for efficient representation. This architecture facilitates MaskGit-style parallel inference, achieving up to 15x speed-ups over autoregressive models. The framework emphasizes parameter, data, and inference efficiency, allowing for compact models and low-latency generation.

Quick Start & Requirements

  • Environment Setup: Create a conda environment (conda create --name pheme3 python=3.10, conda activate pheme3), install PyTorch (pip3 install torch torchvision torchaudio), and then install dependencies (pip3 install -r requirements.txt --no-deps).
  • Pre-trained Models: Download SpeechTokenizer, unique token lists, and T2S/S2A models from Hugging Face. Requires a Hugging Face Hub token for speaker embeddings.
  • Data Preparation: Audio files must be resampled to 16kHz and formatted into JSON manifests with text, raw text, duration, and phoneme information.
  • Inference: Invoked via python transformer_infer.py.
  • Training: Separate scripts train_t2s.py and train_s2a.py are provided for training the Text-to-Speech and Acoustic-to-Speech components, respectively.
  • Dependencies: Python 3.10, PyTorch, parallel for audio processing. GPU acceleration is implied for training and inference.

Highlighted Details

  • Achieves high-quality TTS with up to 10x less training data compared to VALL-E or SoundStorm.
  • Supports training on conversational, podcast, and noisy datasets like GigaSpeech.
  • Offers 15x faster inference through MaskGit-style parallel generation.
  • Demonstrates low RTF (0.133ms on A100 GPU) for both 100M and 300M parameter variants.

Maintenance & Community

The project is associated with PolyAI and the authors of the "Pheme: Efficient and Conversational Speech Generation" paper. Links to demos and audio samples are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not specify the license, which is a critical factor for adoption. While pre-trained models are available, the setup for training requires careful data preparation and environment configuration.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.