Pytorch implementation of MetaAI's Voicebox text-to-speech model
Top 51.8% on sourcepulse
This repository provides a PyTorch implementation of MetaAI's Voicebox, a state-of-the-art text-to-speech (TTS) model. It aims to offer a flexible and efficient framework for researchers and developers working on advanced speech synthesis, particularly those interested in multilingual and universal speech generation.
How It Works
The implementation leverages a conditional flow matching approach, integrating components like HubertWithKmeans for semantic tokenization and EncodecVoco for audio encoding/decoding. It supports both text-conditioned and unconditional generation, utilizing adaptive normalization for time conditioning and offering flexibility in ODE solver choices (torchdiffeq, torchode).
Quick Start & Requirements
pip install voicebox-pytorch
HubertWithKmeans
(e.g., from fairseq) and potentially a trained TextToSemantic
model (e.g., Spear-TTS).Highlighted Details
torchdiffeq
and torchode
for ODE solving.Maintenance & Community
The project has received sponsorship from StabilityAI and an Imminent Grant. Notable contributors include Bryan Chiang and Lucas Newman. Community links are not explicitly provided in the README.
Licensing & Compatibility
The repository does not explicitly state a license. However, given the nature of open-source implementations of research papers, users should verify licensing for commercial use or integration into closed-source projects.
Limitations & Caveats
The author recommends using E2 TTS instead of this implementation. Some aspects, like correctly handling MelVoco encode settings and specifying sampling duration in seconds, are still marked as "to-do." The project appears to be under active development with some features still pending.
10 months ago
1 day