CosyVoice by FunAudioLLM

Voice generation model for inference, training, and deployment

Created 1 year ago

18,882 stars

Top 2.4% on SourcePulse

Project Summary

CosyVoice is a comprehensive, multilingual large voice generation model offering full-stack capabilities for inference, training, and deployment. It targets researchers and developers needing advanced text-to-speech (TTS) and voice conversion (VC) functionalities, enabling high-quality, low-latency, and expressive speech synthesis across multiple languages and dialects.

How It Works

CosyVoice 2.0 leverages a combination of offline and streaming modeling technologies for ultra-low latency bidirectional streaming. It achieves rapid first-packet synthesis with latencies as low as 150ms. The model boasts improved pronunciation accuracy, reduced character error rates, and enhanced prosody and sound quality, evidenced by higher MOS scores. It supports cross-lingual and mix-lingual zero-shot voice cloning, allowing for seamless voice replication across different languages and code-switching scenarios.

Quick Start & Requirements

Install: Clone the repository with submodules (git clone --recursive) and install dependencies via Conda (conda create -n cosyvoice python=3.10, conda install -c conda-forge pynini, pip install -r requirements.txt). Ensure sox and libsox-dev are installed on Ubuntu/CentOS.
Models: Download pretrained models (e.g., iic/CosyVoice2-0.5B) using modelscope.snapshot_download or git clone.
Usage: Examples provided for CosyVoice2 and CosyVoice classes, demonstrating zero-shot, cross-lingual, fine-grained control, and instruct-based synthesis.
Demo: A web demo is available at funaudiollm.github.io/cosyvoice2.

Highlighted Details

CosyVoice 2.0 offers improved accuracy, stability, and speed over version 1.0.
Supports Chinese, English, Japanese, Korean, and various Chinese dialects.
Features zero-shot voice cloning for cross-lingual and code-switching scenarios.
Achieves ultra-low latency bidirectional streaming with rapid first-packet synthesis (150ms).
Claims a 30-50% reduction in pronunciation errors compared to version 1.0.

Maintenance & Community

The project acknowledges borrowing code from FunASR, FunCodec, Matcha-TTS, AcademiCodec, and WeNet. Discussion and communication are primarily through GitHub Issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires further investigation for commercial use or closed-source linking.

Limitations & Caveats

The README does not specify licensing details, which is a critical factor for adoption. Some examples are sourced from the internet, with a disclaimer for potential content infringement.

CosyVoice by FunAudioLLM

Explore Similar Projects

ControlSpeech by jishengpeng

SpeechGPT-2.0-preview by OpenMOSS

xtts-webui by daswer123

Step-Audio by stepfun-ai

KittenTTS by KittenML

Orpheus-TTS by canopyai

parler-tts by huggingface

metavoice-src by metavoiceio

Zonos by Zyphra

dia by nari-labs

OpenVoice by myshell-ai

GPT-SoVITS by RVC-Boss