vall-e by lifeiteng

PyTorch for zero-shot text-to-speech synthesis, re-implementing VALL-E

Created 3 years ago

2,197 stars

Top 20.3% on SourcePulse

Project Summary

This repository provides an unofficial PyTorch implementation of VALL-E, a zero-shot text-to-speech (TTS) system. It enables users to train and synthesize speech with speaker identity preservation, targeting researchers and developers in the speech synthesis domain. The primary benefit is enabling zero-shot TTS capabilities with a PyTorch-based framework.

How It Works

VALL-E is a neural codec language model that leverages a discrete variational autoencoder (VAE) to represent audio. It operates in a non-autoregressive (NAR) manner for acoustic modeling, conditioned on text and a short audio prompt. The implementation uses a prefix-mode approach for conditioning, allowing for flexible speaker adaptation and style transfer.

Quick Start & Requirements

Install: Follow the detailed installation steps in the README, which involve installing specific versions of PyTorch, torchaudio, and other dependencies like librosa, phonemizer, pypinyin, lhotse, and k2. This includes cloning the icefall repository and setting up the Python path.
Prerequisites: PyTorch 1.13.1 with CUDA 11.6, Python 3.10, espeak-ng (Linux/macOS), and potentially a 24GB GPU for training.
Setup Time: The installation process involves multiple steps and dependency management, which may take a significant amount of time.
Links: Official Demo, Reproduced Demo, Icefall, Lhotse.

Highlighted Details

Supports training on a single GPU with 24GB memory.
Includes examples for both English (LibriTTS) and Chinese (AIShell1) datasets.
Offers different prefix modes for NAR decoder conditioning, including a mode matching the original paper.
Provides detailed training scripts for both AR and NAR models.

Maintenance & Community

The project is maintained by Feiteng Li.
Links to community resources are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license.
The project is an unofficial implementation, and the original VALL-E paper mentions potential risks of misuse, with no well-trained models or services being provided to avoid abuse.

Limitations & Caveats

The project is an unofficial implementation and may not perfectly replicate the original VALL-E's performance. The installation process is complex, requiring specific dependency versions and manual path configurations. The README notes potential misuse risks associated with voice impersonation, and no pre-trained models are offered.

vall-e by lifeiteng

Explore Similar Projects

VoiceStar by jasonppy

assem-vc by maum-ai

voicebox-pytorch by lucidrains

VTuberTalk by jerryuhoo

FastDiff by Rongjiehuang

dataspeech by huggingface

zamia-speech by gooofy

SLAM-LLM by X-LANCE

Whisper-Finetune by yeyupiaoling

icefall by k2-fsa

PaddleSpeech by PaddlePaddle

TTS by coqui-ai