Meta-voicebox  by SpeechifyInc

PyTorch implementation of Meta's Voicebox speech model

created 2 years ago
583 stars

Top 56.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a PyTorch implementation of Voicebox, a large-scale, text-guided generative AI model for speech. It aims to generalize across various speech tasks, including text-to-speech synthesis, noise removal, and content editing, offering state-of-the-art performance and faster inference compared to existing models. The target audience includes researchers and developers working on advanced speech synthesis and manipulation.

How It Works

Voicebox utilizes a non-autoregressive flow-matching model trained on over 50,000 hours of diverse speech data. This approach allows it to perform "infilling" of speech, conditioning on both past and future audio context along with text. This design enables in-context learning similar to large language models, providing flexibility for tasks like zero-shot TTS, noise removal, style transfer, and content editing.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: PyTorch, Python 3.8+, and potentially CUDA for GPU acceleration. Specific hardware requirements are not detailed but are implied for large-scale model training/inference.
  • Resources: Training on 50K+ hours of speech suggests significant computational resources and storage.
  • Links: Voicebox Paper

Highlighted Details

  • Outperforms VALL-E in intelligibility (5.9% vs 1.9% WER) and audio similarity (0.580 vs 0.681).
  • Up to 20x faster inference than state-of-the-art TTS models.
  • Supports mono and cross-lingual zero-shot text-to-speech synthesis.
  • Capable of noise removal, content editing, and style conversion.

Maintenance & Community

The project is associated with Meta AI researchers. Further community engagement channels (e.g., Discord, Slack) or a roadmap are not explicitly mentioned in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

The provided README indicates that several key features, including training scripts, cross-lingual style transfer, and specific editing capabilities, are still marked as "Todo" items, suggesting the implementation may be incomplete or under active development.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.