emotional-vits by innnky

VITS-based speech synthesis for controllable emotion without transcript labels

Created 3 years ago

1,395 stars

Top 29.0% on SourcePulse

Project Summary

This project provides an emotion-controllable text-to-speech (TTS) model based on VITS, designed for researchers and developers working with TTS systems who want to add emotional expressiveness without requiring manually labeled emotional data. It allows for nuanced emotional control by leveraging an emotion embedding extracted from reference audio.

How It Works

The model modifies the VITS architecture by incorporating an emotion embedding into the TextEncoder. Instead of relying on explicit emotion labels, it extracts an embedding from a reference audio clip provided during inference. This embedding captures the emotional characteristics of the reference audio, which the model then uses to synthesize speech with a similar emotional tone. This approach allows for a continuous emotional space, theoretically supporting any emotion present in the training data without predefined categories.

Quick Start & Requirements

Install: Clone the repository and install Python requirements from requirements.txt.
Prerequisites: Python >= 3.6, Cython for Monotonic Alignment Search, and potentially a Japanese TTS dataset (e.g., "nene") or your own dataset.
Setup: Requires preprocessing for custom datasets, including phoneme extraction and emotion embedding generation using emotion_extract.py.
Links: bilibili demo (linked via ↑↑↑ in README)

Highlighted Details

Enables emotion control on any standard TTS dataset without manual emotion labeling.
Leverages a continuous emotional embedding space, theoretically supporting unlimited emotions.
Emotion embeddings can be clustered to help identify distinct emotional categories for single-character models.
Inference requires a reference audio clip to guide the emotional output.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels (like Discord/Slack) is provided in the README.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The model requires a reference audio clip for each inference to specify the desired emotion, as it does not inherently understand emotion labels like "excited" or "calm." Manually mapping text to emotion embeddings can be cumbersome, especially for multi-character models where emotional nuances may differ per character.

emotional-vits by innnky

Explore Similar Projects

awesome-audio-plaza by metame-ai

SpeechGPT-2.0-preview by OpenMOSS

Marco-Voice by AIDC-AI

GigaAM by salute-developers

dl-for-emo-tts by Emotional-Text-to-Speech

indexTTS2 by iszhanjiawei

naturalspeech2-pytorch by lucidrains

Step-Audio by stepfun-ai

Orpheus-TTS by canopyai

higgs-audio by boson-ai

Zonos by Zyphra

whisper-vits-svc by PlayVoice