emotional-vits  by innnky

VITS-based speech synthesis for controllable emotion without transcript labels

Created 3 years ago
1,391 stars

Top 29.1% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an emotion-controllable text-to-speech (TTS) model based on VITS, designed for researchers and developers working with TTS systems who want to add emotional expressiveness without requiring manually labeled emotional data. It allows for nuanced emotional control by leveraging an emotion embedding extracted from reference audio.

How It Works

The model modifies the VITS architecture by incorporating an emotion embedding into the TextEncoder. Instead of relying on explicit emotion labels, it extracts an embedding from a reference audio clip provided during inference. This embedding captures the emotional characteristics of the reference audio, which the model then uses to synthesize speech with a similar emotional tone. This approach allows for a continuous emotional space, theoretically supporting any emotion present in the training data without predefined categories.

Quick Start & Requirements

  • Install: Clone the repository and install Python requirements from requirements.txt.
  • Prerequisites: Python >= 3.6, Cython for Monotonic Alignment Search, and potentially a Japanese TTS dataset (e.g., "nene") or your own dataset.
  • Setup: Requires preprocessing for custom datasets, including phoneme extraction and emotion embedding generation using emotion_extract.py.
  • Links: bilibili demo (linked via ↑↑↑ in README)

Highlighted Details

  • Enables emotion control on any standard TTS dataset without manual emotion labeling.
  • Leverages a continuous emotional embedding space, theoretically supporting unlimited emotions.
  • Emotion embeddings can be clustered to help identify distinct emotional categories for single-character models.
  • Inference requires a reference audio clip to guide the emotional output.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels (like Discord/Slack) is provided in the README.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The model requires a reference audio clip for each inference to specify the desired emotion, as it does not inherently understand emotion labels like "excited" or "calm." Manually mapping text to emotion embeddings can be cumbersome, especially for multi-character models where emotional nuances may differ per character.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.