pheme by PolyAI-LDN

TTS framework for efficient, conversational speech generation (research paper)

Created 2 years ago

261 stars

Top 97.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Benjamin Bolte

Cofounder of K-Scale Labs

Project Summary

Pheme is an open-source framework for training efficient and conversational Text-to-Speech (TTS) models, designed for researchers and developers seeking high-quality speech synthesis with reduced data and computational requirements. It enables training Transformer-based models using significantly less data than comparable systems like VALL-E or SoundStorm, while supporting diverse data sources including conversational, podcast, and noisy audio.

How It Works

Pheme separates semantic and acoustic tokens, leveraging a specialized speech tokenizer for efficient representation. This architecture facilitates MaskGit-style parallel inference, achieving up to 15x speed-ups over autoregressive models. The framework emphasizes parameter, data, and inference efficiency, allowing for compact models and low-latency generation.

Quick Start & Requirements

Environment Setup: Create a conda environment (conda create --name pheme3 python=3.10, conda activate pheme3), install PyTorch (pip3 install torch torchvision torchaudio), and then install dependencies (pip3 install -r requirements.txt --no-deps).
Pre-trained Models: Download SpeechTokenizer, unique token lists, and T2S/S2A models from Hugging Face. Requires a Hugging Face Hub token for speaker embeddings.
Data Preparation: Audio files must be resampled to 16kHz and formatted into JSON manifests with text, raw text, duration, and phoneme information.
Inference: Invoked via python transformer_infer.py.
Training: Separate scripts train_t2s.py and train_s2a.py are provided for training the Text-to-Speech and Acoustic-to-Speech components, respectively.
Dependencies: Python 3.10, PyTorch, parallel for audio processing. GPU acceleration is implied for training and inference.

Highlighted Details

Achieves high-quality TTS with up to 10x less training data compared to VALL-E or SoundStorm.
Supports training on conversational, podcast, and noisy datasets like GigaSpeech.
Offers 15x faster inference through MaskGit-style parallel generation.
Demonstrates low RTF (0.133ms on A100 GPU) for both 100M and 300M parameter variants.

Maintenance & Community

The project is associated with PolyAI and the authors of the "Pheme: Efficient and Conversational Speech Generation" paper. Links to demos and audio samples are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not specify the license, which is a critical factor for adoption. While pre-trained models are available, the setup for training requires careful data preparation and environment configuration.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days