voicebox-pytorch  by lucidrains

Pytorch implementation of MetaAI's Voicebox text-to-speech model

Created 2 years ago
661 stars

Top 50.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a PyTorch implementation of MetaAI's Voicebox, a state-of-the-art text-to-speech (TTS) model. It aims to offer a flexible and efficient framework for researchers and developers working on advanced speech synthesis, particularly those interested in multilingual and universal speech generation.

How It Works

The implementation leverages a conditional flow matching approach, integrating components like HubertWithKmeans for semantic tokenization and EncodecVoco for audio encoding/decoding. It supports both text-conditioned and unconditional generation, utilizing adaptive normalization for time conditioning and offering flexibility in ODE solver choices (torchdiffeq, torchode).

Quick Start & Requirements

  • Install: pip install voicebox-pytorch
  • Prerequisites: Requires pre-trained checkpoints for HubertWithKmeans (e.g., from fairseq) and potentially a trained TextToSemantic model (e.g., Spear-TTS).
  • Usage examples for training and sampling are provided in the README.

Highlighted Details

  • Implements Voicebox, a SOTA TTS network from MetaAI.
  • Utilizes rotary embeddings and adaptive normalization.
  • Integrates with Spear-TTS for text-to-semantic conditioning.
  • Supports both torchdiffeq and torchode for ODE solving.

Maintenance & Community

The project has received sponsorship from StabilityAI and an Imminent Grant. Notable contributors include Bryan Chiang and Lucas Newman. Community links are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. However, given the nature of open-source implementations of research papers, users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The author recommends using E2 TTS instead of this implementation. Some aspects, like correctly handling MelVoco encode settings and specifying sampling duration in seconds, are still marked as "to-do." The project appears to be under active development with some features still pending.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.