StarGANv2-VC  by yl4579

Voice conversion research paper using StarGAN v2

created 4 years ago
507 stars

Top 62.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides StarGANv2-VC, an unsupervised, non-parallel framework for diverse voice conversion. It enables many-to-many voice conversion, cross-lingual conversion, and stylistic speech conversion (e.g., emotional, falsetto) without requiring paired data or text labels. The target audience includes researchers and developers working on speech synthesis and voice manipulation.

How It Works

StarGANv2-VC leverages a generative adversarial network (GAN) architecture. It employs an adversarial source classifier loss and a perceptual loss to achieve natural-sounding voice conversion. A key component is the style encoder, which allows for the conversion of plain speech into various styles, enhancing the model's versatility. This approach enables high-quality conversion that rivals state-of-the-art text-to-speech (TTS) systems, even in real-time with compatible vocoders.

Quick Start & Requirements

  • Install: pip install SoundFile torchaudio munch parallel_wavegan pydub pyyaml click librosa
  • Prerequisites: Python >= 3.7, VCTK dataset (downsampled to 24 kHz). Pretrained ASR and F0 models are provided but may require retraining for non-English or non-speech data.
  • Setup: Clone the repository, prepare the VCTK dataset, and configure config.yml.
  • Links: Paper: https://arxiv.org/abs/2107.10394, Audio samples: https://starganv2-vc.github.io/

Highlighted Details

  • Awarded INTERSPEECH 2021 Best Paper Award.
  • Achieves natural-sounding voices comparable to TTS-based methods.
  • Supports real-time voice conversion with vocoders like Parallel WaveGAN.
  • Generalizes to any-to-many, cross-lingual, and singing conversion tasks.

Maintenance & Community

The project is associated with Yinghao Aaron Li and Nima Mesgarani. Further details on community or roadmap are not explicitly provided in the README.

Licensing & Compatibility

The repository's licensing is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

While the provided ASR model works for other languages, retraining custom ASR and F0 models is recommended for optimal performance on non-English or non-speech data. The README notes that batch_size = 5 requires approximately 10GB of GPU RAM, indicating a significant hardware requirement for training.

Health Check
Last commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.