assem-vc by maum-ai

PyTorch code for any-to-many voice conversion research

Created 4 years ago

269 stars

Top 95.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Piotr Dąbkowski

Cofounder of ElevenLabs

Project Summary

Assem-VC is a PyTorch implementation for any-to-many non-parallel voice conversion, aiming for realistic and high-quality speech synthesis. It is designed for researchers and developers in speech processing and audio synthesis who want to build state-of-the-art voice conversion systems. The project offers a novel approach by assembling modern speech synthesis techniques, including GTA finetuning, to achieve improved speaker similarity and naturalness.

How It Works

Assem-VC utilizes a two-encoder-one-decoder architecture, combining the strengths of existing models. It incorporates a "Cotatron" component for acoustic feature extraction and a VC decoder for voice conversion. A key innovation is the introduction of GTA (Ground Truth Alignment) finetuning, which significantly enhances output quality and speaker similarity by leveraging aligned ground truth data during the training process. This approach allows for more stable alignment learning and faster convergence.

Quick Start & Requirements

Installation: git clone --recursive https://github.com/mindslab-ai/assem-vc
Prerequisites: Python 3.6.8, PyTorch 1.4.0, PyTorch Lightning 1.0.3. Datasets (LibriTTS, VCTK) need to be downloaded and resampled to 22.05kHz. Metadata files in a specific format are also required.
Setup: Requires downloading and preparing datasets, which can take time depending on data size and processing speed. Configuration files need to be edited for training.
Resources: Pre-trained models are available for download.
Links: Paper: https://arxiv.org/abs/2104.00931, Audio Samples: https://mindslab-ai.github.io/assem-vc/

Highlighted Details

Achieves state-of-the-art performance on VCTK dataset for naturalness and speaker similarity.
Enables any-to-many voice conversion, meaning conversion to any speaker in the training set.
Introduces GTA finetuning for improved voice conversion quality.
Explores speaker disentanglement of phonetic posteriorgrams (PPG).
Extended to singing voice decomposition and synthesis.

Maintenance & Community

The project is associated with MINDsLab Inc. and SNU. Contact information for Kang-wook Kim is provided for inquiries.

Licensing & Compatibility

License: BSD 3-Clause License.
Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The repository notes that the quality of results may differ from the paper due to the use of an open-source g2p system instead of the proprietary one mentioned. Training speed with multi-GPU settings might be slow due to PyTorch Lightning version issues.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days