VALL-E-X by Plachtaa

Open source implementation of Microsoft's VALL-E X zero-shot TTS model

Created 2 years ago

7,963 stars

Top 6.5% on SourcePulse

View on GitHub

3 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Bryan Helmig

Cofounder of Zapier

Benjamin Bolte

Cofounder of K-Scale Labs

Project Summary

This repository provides an open-source implementation of Microsoft's VALL-E X, a multilingual text-to-speech (TTS) and zero-shot voice cloning model. It enables users to synthesize speech in English, Chinese, and Japanese, clone voices from short audio samples, and control speech emotion and accent.

How It Works

VALL-E X generates audio in a GPT-style by predicting audio tokens quantized by EnCodec. This approach allows for efficient and high-quality speech synthesis. Key advantages include its lightweight nature, faster inference compared to similar models like Bark, and superior performance on Chinese and Japanese, with capabilities for cross-lingual synthesis and easy voice cloning.

Quick Start & Requirements

Install: Clone the repository and install dependencies via pip install -r requirements.txt.
Prerequisites: Python 3.10, CUDA 11.7 ~ 12.0, PyTorch 2.0+. ffmpeg is required for prompt processing and must be in the system's PATH.
Models: The model checkpoint and Whisper's medium.pt will be downloaded automatically on first run or can be manually placed in ./checkpoints/ and ./whisper/ respectively.
Demos: Online demos are available on Hugging Face and Google Colab.

Highlighted Details

Supports zero-shot voice cloning from 3-10 second audio prompts.
Enables speech emotion and accent control.
Features zero-shot cross-lingual speech synthesis.
Maintains acoustic environment from prompts.
Offers a user-friendly UI (python -X utf8 launch-ui.py).

Maintenance & Community

The project is actively maintained with recent updates including AR decoder batch decoding and replacing EnCodec with Vocos decoder. Community support is available via Discord.

Licensing & Compatibility

Licensed under the MIT License, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

Generated audio is limited to approximately 22 seconds due to Transformer's quadratic complexity. The project does not release training code, referencing an external implementation. Fine-tuning for better voice adaptation and .bat scripts are listed as future TODOs.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days