PyTorch implementation of Meta's Voicebox speech model
Top 56.4% on sourcepulse
This repository provides a PyTorch implementation of Voicebox, a large-scale, text-guided generative AI model for speech. It aims to generalize across various speech tasks, including text-to-speech synthesis, noise removal, and content editing, offering state-of-the-art performance and faster inference compared to existing models. The target audience includes researchers and developers working on advanced speech synthesis and manipulation.
How It Works
Voicebox utilizes a non-autoregressive flow-matching model trained on over 50,000 hours of diverse speech data. This approach allows it to perform "infilling" of speech, conditioning on both past and future audio context along with text. This design enables in-context learning similar to large language models, providing flexibility for tasks like zero-shot TTS, noise removal, style transfer, and content editing.
Quick Start & Requirements
pip install -r requirements.txt
Highlighted Details
Maintenance & Community
The project is associated with Meta AI researchers. Further community engagement channels (e.g., Discord, Slack) or a roadmap are not explicitly mentioned in the README.
Licensing & Compatibility
The README does not specify a license. Compatibility for commercial use or closed-source linking is therefore undetermined.
Limitations & Caveats
The provided README indicates that several key features, including training scripts, cross-lingual style transfer, and specific editing capabilities, are still marked as "Todo" items, suggesting the implementation may be incomplete or under active development.
2 years ago
Inactive