Marco-Voice by AIDC-AI

Unified framework for expressive speech synthesis

Created 3 months ago

380 stars

Top 74.8% on SourcePulse

Project Summary

Marco-Voice is a unified framework for expressive speech synthesis, offering voice cloning, emotion control, and cross-lingual capabilities. It aims to generate highly expressive, controllable, and natural speech that preserves speaker identity across diverse linguistic and emotional contexts. The target audience includes researchers and developers in speech synthesis and human-computer interaction, with the primary benefit being advanced control over synthesized speech characteristics.

How It Works

Marco-Voice employs a speaker-emotion disentanglement mechanism, utilizing in-batch contrastive learning to separate speaker identity from emotional style. A rotational emotion embedding integration method allows for smooth emotion control. A cross-attention mechanism further integrates emotional information with linguistic content during generation. This approach enables independent manipulation of speaker identity and emotional expression, leading to more nuanced and controllable speech synthesis.

Quick Start & Requirements

Installation: Requires Conda environment setup (conda create -n marco python=3.8, conda activate marco), cloning the repository, and installing requirements (pip install -r requirements.txt).
Prerequisites: Python 3.8, Conda. Training requires 8x NVIDIA A100 (80GB) GPUs.
Data Preparation: Scripts are provided for preparing custom datasets and integrating open-source data.
Links: Github, Paper, Hugging Face Datasets.

Highlighted Details

Achieves superior performance in speech clarity, naturalness, and speaker similarity compared to previous models like CosyVoice1 and CosyVoice2, as validated by human evaluations and objective metrics.
Introduces the CSEMOTIONS dataset, featuring 10.2 hours of Mandarin speech across seven emotional categories from professional voice actors, along with evaluation prompts in English and Chinese.
Supports cross-lingual emotion transfer, enabling the application of emotional styles across different languages.
Integrates with various ASR models (Whisper-large-v3 for English, Paraformer-zh for Mandarin) for Word Error Rate (WER) calculation.

Maintenance & Community

The project is developed by Alibaba International Digital Commerce. Community suggestions are welcomed for continuous improvement.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, permitting commercial use and linking with closed-source projects.

Limitations & Caveats

The primary training configuration specifies 8x NVIDIA A100 GPUs, indicating a high hardware requirement for training. While evaluation metrics are provided for English and Mandarin, performance on other languages may vary depending on ASR model compatibility.

Marco-Voice by AIDC-AI

Explore Similar Projects

ComfyUI_IndexTTS by billwuhao

dl-for-emo-tts by Emotional-Text-to-Speech

LingChat by SlimeBoyOwO

indexTTS2 by iszhanjiawei

StyleTTS by yl4579

emotional-vits by innnky

Step-Audio by stepfun-ai

Orpheus-TTS by canopyai

higgs-audio by boson-ai

VibeVoice by microsoft

VALL-E-X by Plachtaa

ChatTTS by 2noise