assem-vc  by maum-ai

PyTorch code for any-to-many voice conversion research

created 4 years ago
267 stars

Top 96.7% on sourcepulse

GitHubView on GitHub
Project Summary

Assem-VC is a PyTorch implementation for any-to-many non-parallel voice conversion, aiming for realistic and high-quality speech synthesis. It is designed for researchers and developers in speech processing and audio synthesis who want to build state-of-the-art voice conversion systems. The project offers a novel approach by assembling modern speech synthesis techniques, including GTA finetuning, to achieve improved speaker similarity and naturalness.

How It Works

Assem-VC utilizes a two-encoder-one-decoder architecture, combining the strengths of existing models. It incorporates a "Cotatron" component for acoustic feature extraction and a VC decoder for voice conversion. A key innovation is the introduction of GTA (Ground Truth Alignment) finetuning, which significantly enhances output quality and speaker similarity by leveraging aligned ground truth data during the training process. This approach allows for more stable alignment learning and faster convergence.

Quick Start & Requirements

  • Installation: git clone --recursive https://github.com/mindslab-ai/assem-vc
  • Prerequisites: Python 3.6.8, PyTorch 1.4.0, PyTorch Lightning 1.0.3. Datasets (LibriTTS, VCTK) need to be downloaded and resampled to 22.05kHz. Metadata files in a specific format are also required.
  • Setup: Requires downloading and preparing datasets, which can take time depending on data size and processing speed. Configuration files need to be edited for training.
  • Resources: Pre-trained models are available for download.
  • Links: Paper: https://arxiv.org/abs/2104.00931, Audio Samples: https://mindslab-ai.github.io/assem-vc/

Highlighted Details

  • Achieves state-of-the-art performance on VCTK dataset for naturalness and speaker similarity.
  • Enables any-to-many voice conversion, meaning conversion to any speaker in the training set.
  • Introduces GTA finetuning for improved voice conversion quality.
  • Explores speaker disentanglement of phonetic posteriorgrams (PPG).
  • Extended to singing voice decomposition and synthesis.

Maintenance & Community

The project is associated with MINDsLab Inc. and SNU. Contact information for Kang-wook Kim is provided for inquiries.

Licensing & Compatibility

  • License: BSD 3-Clause License.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The repository notes that the quality of results may differ from the paper due to the use of an open-source g2p system instead of the proprietary one mentioned. Training speed with multi-GPU settings might be slow due to PyTorch Lightning version issues.

Health Check
Last commit

3 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.