PyTorch code for any-to-many voice conversion research
Top 96.7% on sourcepulse
Assem-VC is a PyTorch implementation for any-to-many non-parallel voice conversion, aiming for realistic and high-quality speech synthesis. It is designed for researchers and developers in speech processing and audio synthesis who want to build state-of-the-art voice conversion systems. The project offers a novel approach by assembling modern speech synthesis techniques, including GTA finetuning, to achieve improved speaker similarity and naturalness.
How It Works
Assem-VC utilizes a two-encoder-one-decoder architecture, combining the strengths of existing models. It incorporates a "Cotatron" component for acoustic feature extraction and a VC decoder for voice conversion. A key innovation is the introduction of GTA (Ground Truth Alignment) finetuning, which significantly enhances output quality and speaker similarity by leveraging aligned ground truth data during the training process. This approach allows for more stable alignment learning and faster convergence.
Quick Start & Requirements
git clone --recursive https://github.com/mindslab-ai/assem-vc
Highlighted Details
Maintenance & Community
The project is associated with MINDsLab Inc. and SNU. Contact information for Kang-wook Kim is provided for inquiries.
Licensing & Compatibility
Limitations & Caveats
The repository notes that the quality of results may differ from the paper due to the use of an open-source g2p system instead of the proprietary one mentioned. Training speed with multi-GPU settings might be slow due to PyTorch Lightning version issues.
3 years ago
1 day