Meta-voicebox by SpeechifyInc

PyTorch implementation of Meta's Voicebox speech model

Created 2 years ago

588 stars

Top 55.2% on SourcePulse

Project Summary

This repository provides a PyTorch implementation of Voicebox, a large-scale, text-guided generative AI model for speech. It aims to generalize across various speech tasks, including text-to-speech synthesis, noise removal, and content editing, offering state-of-the-art performance and faster inference compared to existing models. The target audience includes researchers and developers working on advanced speech synthesis and manipulation.

How It Works

Voicebox utilizes a non-autoregressive flow-matching model trained on over 50,000 hours of diverse speech data. This approach allows it to perform "infilling" of speech, conditioning on both past and future audio context along with text. This design enables in-context learning similar to large language models, providing flexibility for tasks like zero-shot TTS, noise removal, style transfer, and content editing.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: PyTorch, Python 3.8+, and potentially CUDA for GPU acceleration. Specific hardware requirements are not detailed but are implied for large-scale model training/inference.
Resources: Training on 50K+ hours of speech suggests significant computational resources and storage.
Links: Voicebox Paper

Highlighted Details

Outperforms VALL-E in intelligibility (5.9% vs 1.9% WER) and audio similarity (0.580 vs 0.681).
Up to 20x faster inference than state-of-the-art TTS models.
Supports mono and cross-lingual zero-shot text-to-speech synthesis.
Capable of noise removal, content editing, and style conversion.

Maintenance & Community

The project is associated with Meta AI researchers. Further community engagement channels (e.g., Discord, Slack) or a roadmap are not explicitly mentioned in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

The provided README indicates that several key features, including training scripts, cross-lingual style transfer, and specific editing capabilities, are still marked as "Todo" items, suggesting the implementation may be incomplete or under active development.

Meta-voicebox by SpeechifyInc

Explore Similar Projects

speech-recognition-uk by egorsmkv

pheme by PolyAI-LDN

SpeechGPT-2.0-preview by OpenMOSS

FastDiff by Rongjiehuang

VITA-Audio by VITA-MLLM

HierSpeechpp by sh-lee-prml

KittenTTS by KittenML

parler-tts by huggingface

higgs-audio by boson-ai

StyleTTS2 by yl4579

Zonos by Zyphra

GPT-SoVITS by RVC-Boss