voicebox-pytorch  by lucidrains

Pytorch implementation of MetaAI's Voicebox text-to-speech model

created 2 years ago
658 stars

Top 51.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a PyTorch implementation of MetaAI's Voicebox, a state-of-the-art text-to-speech (TTS) model. It aims to offer a flexible and efficient framework for researchers and developers working on advanced speech synthesis, particularly those interested in multilingual and universal speech generation.

How It Works

The implementation leverages a conditional flow matching approach, integrating components like HubertWithKmeans for semantic tokenization and EncodecVoco for audio encoding/decoding. It supports both text-conditioned and unconditional generation, utilizing adaptive normalization for time conditioning and offering flexibility in ODE solver choices (torchdiffeq, torchode).

Quick Start & Requirements

  • Install: pip install voicebox-pytorch
  • Prerequisites: Requires pre-trained checkpoints for HubertWithKmeans (e.g., from fairseq) and potentially a trained TextToSemantic model (e.g., Spear-TTS).
  • Usage examples for training and sampling are provided in the README.

Highlighted Details

  • Implements Voicebox, a SOTA TTS network from MetaAI.
  • Utilizes rotary embeddings and adaptive normalization.
  • Integrates with Spear-TTS for text-to-semantic conditioning.
  • Supports both torchdiffeq and torchode for ODE solving.

Maintenance & Community

The project has received sponsorship from StabilityAI and an Imminent Grant. Notable contributors include Bryan Chiang and Lucas Newman. Community links are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. However, given the nature of open-source implementations of research papers, users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The author recommends using E2 TTS instead of this implementation. Some aspects, like correctly handling MelVoco encode settings and specifying sampling duration in seconds, are still marked as "to-do." The project appears to be under active development with some features still pending.

Health Check
Last commit

10 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.