CosyVoice  by FunAudioLLM

Voice generation model for inference, training, and deployment

Created 1 year ago
16,439 stars

Top 2.9% on SourcePulse

GitHubView on GitHub
Project Summary

CosyVoice is a comprehensive, multilingual large voice generation model offering full-stack capabilities for inference, training, and deployment. It targets researchers and developers needing advanced text-to-speech (TTS) and voice conversion (VC) functionalities, enabling high-quality, low-latency, and expressive speech synthesis across multiple languages and dialects.

How It Works

CosyVoice 2.0 leverages a combination of offline and streaming modeling technologies for ultra-low latency bidirectional streaming. It achieves rapid first-packet synthesis with latencies as low as 150ms. The model boasts improved pronunciation accuracy, reduced character error rates, and enhanced prosody and sound quality, evidenced by higher MOS scores. It supports cross-lingual and mix-lingual zero-shot voice cloning, allowing for seamless voice replication across different languages and code-switching scenarios.

Quick Start & Requirements

  • Install: Clone the repository with submodules (git clone --recursive) and install dependencies via Conda (conda create -n cosyvoice python=3.10, conda install -c conda-forge pynini, pip install -r requirements.txt). Ensure sox and libsox-dev are installed on Ubuntu/CentOS.
  • Models: Download pretrained models (e.g., iic/CosyVoice2-0.5B) using modelscope.snapshot_download or git clone.
  • Usage: Examples provided for CosyVoice2 and CosyVoice classes, demonstrating zero-shot, cross-lingual, fine-grained control, and instruct-based synthesis.
  • Demo: A web demo is available at funaudiollm.github.io/cosyvoice2.

Highlighted Details

  • CosyVoice 2.0 offers improved accuracy, stability, and speed over version 1.0.
  • Supports Chinese, English, Japanese, Korean, and various Chinese dialects.
  • Features zero-shot voice cloning for cross-lingual and code-switching scenarios.
  • Achieves ultra-low latency bidirectional streaming with rapid first-packet synthesis (150ms).
  • Claims a 30-50% reduction in pronunciation errors compared to version 1.0.

Maintenance & Community

The project acknowledges borrowing code from FunASR, FunCodec, Matcha-TTS, AcademiCodec, and WeNet. Discussion and communication are primarily through GitHub Issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires further investigation for commercial use or closed-source linking.

Limitations & Caveats

The README does not specify licensing details, which is a critical factor for adoption. Some examples are sourced from the internet, with a disclaimer for potential content infringement.

Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
7
Issues (30d)
52
Star History
647 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
6 more.

OpenVoice by myshell-ai

0.2%
34k
Audio foundation model for versatile, instant voice cloning
Created 1 year ago
Updated 5 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.