assem-vc  by maum-ai

PyTorch code for any-to-many voice conversion research

Created 4 years ago
267 stars

Top 95.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Assem-VC is a PyTorch implementation for any-to-many non-parallel voice conversion, aiming for realistic and high-quality speech synthesis. It is designed for researchers and developers in speech processing and audio synthesis who want to build state-of-the-art voice conversion systems. The project offers a novel approach by assembling modern speech synthesis techniques, including GTA finetuning, to achieve improved speaker similarity and naturalness.

How It Works

Assem-VC utilizes a two-encoder-one-decoder architecture, combining the strengths of existing models. It incorporates a "Cotatron" component for acoustic feature extraction and a VC decoder for voice conversion. A key innovation is the introduction of GTA (Ground Truth Alignment) finetuning, which significantly enhances output quality and speaker similarity by leveraging aligned ground truth data during the training process. This approach allows for more stable alignment learning and faster convergence.

Quick Start & Requirements

  • Installation: git clone --recursive https://github.com/mindslab-ai/assem-vc
  • Prerequisites: Python 3.6.8, PyTorch 1.4.0, PyTorch Lightning 1.0.3. Datasets (LibriTTS, VCTK) need to be downloaded and resampled to 22.05kHz. Metadata files in a specific format are also required.
  • Setup: Requires downloading and preparing datasets, which can take time depending on data size and processing speed. Configuration files need to be edited for training.
  • Resources: Pre-trained models are available for download.
  • Links: Paper: https://arxiv.org/abs/2104.00931, Audio Samples: https://mindslab-ai.github.io/assem-vc/

Highlighted Details

  • Achieves state-of-the-art performance on VCTK dataset for naturalness and speaker similarity.
  • Enables any-to-many voice conversion, meaning conversion to any speaker in the training set.
  • Introduces GTA finetuning for improved voice conversion quality.
  • Explores speaker disentanglement of phonetic posteriorgrams (PPG).
  • Extended to singing voice decomposition and synthesis.

Maintenance & Community

The project is associated with MINDsLab Inc. and SNU. Contact information for Kang-wook Kim is provided for inquiries.

Licensing & Compatibility

  • License: BSD 3-Clause License.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The repository notes that the quality of results may differ from the paper due to the use of an open-source g2p system instead of the proprietary one mentioned. Training speed with multi-GPU settings might be slow due to PyTorch Lightning version issues.

Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Christian Laforte Christian Laforte(Distinguished Engineer at NVIDIA; Former CTO at Stability AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

Amphion by open-mmlab

0.2%
9k
Toolkit for audio, music, and speech generation research
Created 1 year ago
Updated 3 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.