VALL-E-X  by Plachtaa

Open source implementation of Microsoft's VALL-E X zero-shot TTS model

created 2 years ago
7,906 stars

Top 6.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an open-source implementation of Microsoft's VALL-E X, a multilingual text-to-speech (TTS) and zero-shot voice cloning model. It enables users to synthesize speech in English, Chinese, and Japanese, clone voices from short audio samples, and control speech emotion and accent.

How It Works

VALL-E X generates audio in a GPT-style by predicting audio tokens quantized by EnCodec. This approach allows for efficient and high-quality speech synthesis. Key advantages include its lightweight nature, faster inference compared to similar models like Bark, and superior performance on Chinese and Japanese, with capabilities for cross-lingual synthesis and easy voice cloning.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Python 3.10, CUDA 11.7 ~ 12.0, PyTorch 2.0+. ffmpeg is required for prompt processing and must be in the system's PATH.
  • Models: The model checkpoint and Whisper's medium.pt will be downloaded automatically on first run or can be manually placed in ./checkpoints/ and ./whisper/ respectively.
  • Demos: Online demos are available on Hugging Face and Google Colab.

Highlighted Details

  • Supports zero-shot voice cloning from 3-10 second audio prompts.
  • Enables speech emotion and accent control.
  • Features zero-shot cross-lingual speech synthesis.
  • Maintains acoustic environment from prompts.
  • Offers a user-friendly UI (python -X utf8 launch-ui.py).

Maintenance & Community

The project is actively maintained with recent updates including AR decoder batch decoding and replacing EnCodec with Vocos decoder. Community support is available via Discord.

Licensing & Compatibility

Licensed under the MIT License, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

Generated audio is limited to approximately 22 seconds due to Transformer's quadratic complexity. The project does not release training code, referencing an external implementation. Fine-tuning for better voice adaptation and .bat scripts are listed as future TODOs.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
80 stars in the last 90 days

Explore Similar Projects

Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Joe Walnes Joe Walnes(Head of Experimental Projects at Stripe), and
1 more.

chatterbox by resemble-ai

1.6%
10k
Open-source TTS model
created 3 months ago
updated 1 day ago
Feedback? Help us improve.