CosyVoice  by FunAudioLLM

Voice generation model for inference, training, and deployment

created 1 year ago
15,483 stars

Top 3.2% on sourcepulse

GitHubView on GitHub
Project Summary

CosyVoice is a comprehensive, multilingual large voice generation model offering full-stack capabilities for inference, training, and deployment. It targets researchers and developers needing advanced text-to-speech (TTS) and voice conversion (VC) functionalities, enabling high-quality, low-latency, and expressive speech synthesis across multiple languages and dialects.

How It Works

CosyVoice 2.0 leverages a combination of offline and streaming modeling technologies for ultra-low latency bidirectional streaming. It achieves rapid first-packet synthesis with latencies as low as 150ms. The model boasts improved pronunciation accuracy, reduced character error rates, and enhanced prosody and sound quality, evidenced by higher MOS scores. It supports cross-lingual and mix-lingual zero-shot voice cloning, allowing for seamless voice replication across different languages and code-switching scenarios.

Quick Start & Requirements

  • Install: Clone the repository with submodules (git clone --recursive) and install dependencies via Conda (conda create -n cosyvoice python=3.10, conda install -c conda-forge pynini, pip install -r requirements.txt). Ensure sox and libsox-dev are installed on Ubuntu/CentOS.
  • Models: Download pretrained models (e.g., iic/CosyVoice2-0.5B) using modelscope.snapshot_download or git clone.
  • Usage: Examples provided for CosyVoice2 and CosyVoice classes, demonstrating zero-shot, cross-lingual, fine-grained control, and instruct-based synthesis.
  • Demo: A web demo is available at funaudiollm.github.io/cosyvoice2.

Highlighted Details

  • CosyVoice 2.0 offers improved accuracy, stability, and speed over version 1.0.
  • Supports Chinese, English, Japanese, Korean, and various Chinese dialects.
  • Features zero-shot voice cloning for cross-lingual and code-switching scenarios.
  • Achieves ultra-low latency bidirectional streaming with rapid first-packet synthesis (150ms).
  • Claims a 30-50% reduction in pronunciation errors compared to version 1.0.

Maintenance & Community

The project acknowledges borrowing code from FunASR, FunCodec, Matcha-TTS, AcademiCodec, and WeNet. Discussion and communication are primarily through GitHub Issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires further investigation for commercial use or closed-source linking.

Limitations & Caveats

The README does not specify licensing details, which is a critical factor for adoption. Some examples are sourced from the internet, with a disclaimer for potential content infringement.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
7
Issues (30d)
65
Star History
2,070 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
1 more.

OpenVoice by myshell-ai

0.9%
34k
Audio foundation model for versatile, instant voice cloning
created 1 year ago
updated 3 months ago
Feedback? Help us improve.