Marco-Voice  by AIDC-AI

Unified framework for expressive speech synthesis

Created 1 month ago
370 stars

Top 76.3% on SourcePulse

GitHubView on GitHub
Project Summary

Marco-Voice is a unified framework for expressive speech synthesis, offering voice cloning, emotion control, and cross-lingual capabilities. It aims to generate highly expressive, controllable, and natural speech that preserves speaker identity across diverse linguistic and emotional contexts. The target audience includes researchers and developers in speech synthesis and human-computer interaction, with the primary benefit being advanced control over synthesized speech characteristics.

How It Works

Marco-Voice employs a speaker-emotion disentanglement mechanism, utilizing in-batch contrastive learning to separate speaker identity from emotional style. A rotational emotion embedding integration method allows for smooth emotion control. A cross-attention mechanism further integrates emotional information with linguistic content during generation. This approach enables independent manipulation of speaker identity and emotional expression, leading to more nuanced and controllable speech synthesis.

Quick Start & Requirements

  • Installation: Requires Conda environment setup (conda create -n marco python=3.8, conda activate marco), cloning the repository, and installing requirements (pip install -r requirements.txt).
  • Prerequisites: Python 3.8, Conda. Training requires 8x NVIDIA A100 (80GB) GPUs.
  • Data Preparation: Scripts are provided for preparing custom datasets and integrating open-source data.
  • Links: Github, Paper, Hugging Face Datasets.

Highlighted Details

  • Achieves superior performance in speech clarity, naturalness, and speaker similarity compared to previous models like CosyVoice1 and CosyVoice2, as validated by human evaluations and objective metrics.
  • Introduces the CSEMOTIONS dataset, featuring 10.2 hours of Mandarin speech across seven emotional categories from professional voice actors, along with evaluation prompts in English and Chinese.
  • Supports cross-lingual emotion transfer, enabling the application of emotional styles across different languages.
  • Integrates with various ASR models (Whisper-large-v3 for English, Paraformer-zh for Mandarin) for Word Error Rate (WER) calculation.

Maintenance & Community

The project is developed by Alibaba International Digital Commerce. Community suggestions are welcomed for continuous improvement.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, permitting commercial use and linking with closed-source projects.

Limitations & Caveats

The primary training configuration specifies 8x NVIDIA A100 GPUs, indicating a high hardware requirement for training. While evaluation metrics are provided for English and Mandarin, performance on other languages may vary depending on ASR model compatibility.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
100 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
2 more.

ChatTTS by 2noise

0.2%
38k
Generative speech model for daily dialogue
Created 1 year ago
Updated 2 months ago
Feedback? Help us improve.