CosyVoice_For_Windows  by v3ucn

Windows version of a voice model

Created 1 year ago
731 stars

Top 47.3% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a Windows-specific build of CosyVoice, an advanced text-to-speech (TTS) model. It enables users to perform zero-shot, cross-lingual, and instruction-based voice synthesis with high fidelity, targeting researchers and developers working with multilingual speech generation on Windows.

How It Works

CosyVoice leverages a multi-stage approach, likely incorporating components for acoustic modeling, vocoding, and potentially style/speaker embedding. The project emphasizes optimized performance on Windows, requiring specific versions of Python, CUDA, and cuDNN for accelerated inference. It supports various inference modes, including zero-shot (voice cloning from a short audio sample), cross-lingual (synthesizing speech in one language using a prompt in another), and instruct-based synthesis (generating speech based on text and speaker descriptions).

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment with Python 3.11, install dependencies via pip install -r requirements.txt, and install PyTorch with CUDA support (pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121). A specific DeepSpeed build for Windows is also required.
  • Prerequisites: Python 3.11, CUDA 12.1+, cuDNN 9.4+, Git LFS.
  • Models: Pre-trained models (CosyVoice-300M, -SFT, -Instruct, speech_kantts_ttsfrd) must be downloaded.
  • Demo: A web UI can be launched with python3 webui.py.
  • Docs: CosyVoice Paper, CosyVoice Demos, CosyVoice Studio, CosyVoice Code.

Highlighted Details

  • Supports zero-shot, cross-lingual, and instruction-based TTS.
  • Optimized for Windows environments with specific dependency requirements.
  • Offers a web UI for quick experimentation.
  • Provides Docker image for deployment.

Maintenance & Community

The project acknowledges borrowing code from several other open-source projects (FunASR, FunCodec, Matcha-TTS, AcademiCodec, WeNet). Discussion is primarily through GitHub Issues.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, the underlying CosyVoice project is typically associated with research and academic use, and commercial use would require careful review of the original project's licensing.

Limitations & Caveats

The setup is highly specific to Windows and requires precise versions of CUDA and other dependencies, which may be challenging to manage. The project is presented as a "version for Windows environment," implying it might not be the latest official release and could lag behind or introduce platform-specific issues.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
1 more.

fish-speech by fishaudio

0.3%
23k
Open-source TTS for multilingual speech synthesis
Created 1 year ago
Updated 1 week ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.