Discover and explore top open-source AI tools and projects—updated daily.
Unified framework for expressive speech synthesis
Top 76.3% on SourcePulse
Marco-Voice is a unified framework for expressive speech synthesis, offering voice cloning, emotion control, and cross-lingual capabilities. It aims to generate highly expressive, controllable, and natural speech that preserves speaker identity across diverse linguistic and emotional contexts. The target audience includes researchers and developers in speech synthesis and human-computer interaction, with the primary benefit being advanced control over synthesized speech characteristics.
How It Works
Marco-Voice employs a speaker-emotion disentanglement mechanism, utilizing in-batch contrastive learning to separate speaker identity from emotional style. A rotational emotion embedding integration method allows for smooth emotion control. A cross-attention mechanism further integrates emotional information with linguistic content during generation. This approach enables independent manipulation of speaker identity and emotional expression, leading to more nuanced and controllable speech synthesis.
Quick Start & Requirements
conda create -n marco python=3.8
, conda activate marco
), cloning the repository, and installing requirements (pip install -r requirements.txt
).Highlighted Details
Maintenance & Community
The project is developed by Alibaba International Digital Commerce. Community suggestions are welcomed for continuous improvement.
Licensing & Compatibility
The project is licensed under the Apache License 2.0, permitting commercial use and linking with closed-source projects.
Limitations & Caveats
The primary training configuration specifies 8x NVIDIA A100 GPUs, indicating a high hardware requirement for training. While evaluation metrics are provided for English and Mandarin, performance on other languages may vary depending on ASR model compatibility.
1 month ago
Inactive