vall-e  by lifeiteng

PyTorch for zero-shot text-to-speech synthesis, re-implementing VALL-E

created 2 years ago
2,162 stars

Top 21.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an unofficial PyTorch implementation of VALL-E, a zero-shot text-to-speech (TTS) system. It enables users to train and synthesize speech with speaker identity preservation, targeting researchers and developers in the speech synthesis domain. The primary benefit is enabling zero-shot TTS capabilities with a PyTorch-based framework.

How It Works

VALL-E is a neural codec language model that leverages a discrete variational autoencoder (VAE) to represent audio. It operates in a non-autoregressive (NAR) manner for acoustic modeling, conditioned on text and a short audio prompt. The implementation uses a prefix-mode approach for conditioning, allowing for flexible speaker adaptation and style transfer.

Quick Start & Requirements

  • Install: Follow the detailed installation steps in the README, which involve installing specific versions of PyTorch, torchaudio, and other dependencies like librosa, phonemizer, pypinyin, lhotse, and k2. This includes cloning the icefall repository and setting up the Python path.
  • Prerequisites: PyTorch 1.13.1 with CUDA 11.6, Python 3.10, espeak-ng (Linux/macOS), and potentially a 24GB GPU for training.
  • Setup Time: The installation process involves multiple steps and dependency management, which may take a significant amount of time.
  • Links: Official Demo, Reproduced Demo, Icefall, Lhotse.

Highlighted Details

  • Supports training on a single GPU with 24GB memory.
  • Includes examples for both English (LibriTTS) and Chinese (AIShell1) datasets.
  • Offers different prefix modes for NAR decoder conditioning, including a mode matching the original paper.
  • Provides detailed training scripts for both AR and NAR models.

Maintenance & Community

  • The project is maintained by Feiteng Li.
  • Links to community resources are not explicitly provided in the README.

Licensing & Compatibility

  • The repository does not explicitly state a license.
  • The project is an unofficial implementation, and the original VALL-E paper mentions potential risks of misuse, with no well-trained models or services being provided to avoid abuse.

Limitations & Caveats

The project is an unofficial implementation and may not perfectly replicate the original VALL-E's performance. The installation process is complex, requiring specific dependency versions and manual path configurations. The README notes potential misuse risks associated with voice impersonation, and no pre-trained models are offered.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
44 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.