VALL-E-X  by Plachtaa

Open source implementation of Microsoft's VALL-E X zero-shot TTS model

Created 2 years ago
7,922 stars

Top 6.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides an open-source implementation of Microsoft's VALL-E X, a multilingual text-to-speech (TTS) and zero-shot voice cloning model. It enables users to synthesize speech in English, Chinese, and Japanese, clone voices from short audio samples, and control speech emotion and accent.

How It Works

VALL-E X generates audio in a GPT-style by predicting audio tokens quantized by EnCodec. This approach allows for efficient and high-quality speech synthesis. Key advantages include its lightweight nature, faster inference compared to similar models like Bark, and superior performance on Chinese and Japanese, with capabilities for cross-lingual synthesis and easy voice cloning.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Python 3.10, CUDA 11.7 ~ 12.0, PyTorch 2.0+. ffmpeg is required for prompt processing and must be in the system's PATH.
  • Models: The model checkpoint and Whisper's medium.pt will be downloaded automatically on first run or can be manually placed in ./checkpoints/ and ./whisper/ respectively.
  • Demos: Online demos are available on Hugging Face and Google Colab.

Highlighted Details

  • Supports zero-shot voice cloning from 3-10 second audio prompts.
  • Enables speech emotion and accent control.
  • Features zero-shot cross-lingual speech synthesis.
  • Maintains acoustic environment from prompts.
  • Offers a user-friendly UI (python -X utf8 launch-ui.py).

Maintenance & Community

The project is actively maintained with recent updates including AR decoder batch decoding and replacing EnCodec with Vocos decoder. Community support is available via Discord.

Licensing & Compatibility

Licensed under the MIT License, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

Generated audio is limited to approximately 22 seconds due to Transformer's quadratic complexity. The project does not release training code, referencing an external implementation. Fine-tuning for better voice adaptation and .bat scripts are listed as future TODOs.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
1 more.

fish-speech by fishaudio

0.3%
23k
Open-source TTS for multilingual speech synthesis
Created 1 year ago
Updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
6 more.

OpenVoice by myshell-ai

0.2%
34k
Audio foundation model for versatile, instant voice cloning
Created 1 year ago
Updated 5 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.