async_cosyvoice  by qi-hua

Async acceleration for CosyVoice2 inference

Created 6 months ago
417 stars

Top 70.4% on SourcePulse

GitHubView on GitHub
Project Summary

Async CosyVoice accelerates CosyVoice2 inference for Linux users by leveraging vLLM for LLM acceleration and multi-estimator instances for the Flow component. This results in significantly reduced inference times (RTF from 0.25-0.30 to 0.1-0.15) and improved concurrency, making it suitable for researchers and developers needing faster text-to-speech generation.

How It Works

The project integrates vLLM to speed up the Large Language Model (LLM) portion of CosyVoice2 inference. The "Flow" component utilizes official load_jit or load_trt modes, enhanced by multiple estimator instances provided by hexisyztem. This hybrid approach aims to maximize throughput and minimize latency by optimizing both the language generation and the acoustic modeling stages.

Quick Start & Requirements

  • Install: Clone CosyVoice, then clone async_cosyvoice within it. Install dependencies via pip install -r requirements.txt in the async_cosyvoice directory.
  • Prerequisites: Python 3.10.16, Conda, CUDA 12.4, vllm==0.7.3, torch==2.5.1, onnxruntime-gpu==1.19.0. Requires downloading CosyVoice2 - 0.5B model files.
  • Setup: Requires model download and file copying. Configuration is done in config.py.
  • Docs: CosyVoice

Highlighted Details

  • Achieves RTF of 0.1-0.15, halving original inference time.
  • Supports ~150-250ms first-packet latency for streaming.
  • Enables 20 non-streaming or 10 streaming concurrent tasks on a 4070 GPU.
  • Introduces spk2info.pt for faster inference by skipping prompt audio processing.

Maintenance & Community

  • Project is a fork/extension of the original CosyVoice.
  • No specific community links (Discord/Slack) or roadmap mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state the license for async_cosyvoice. It depends on the licensing of the original CosyVoice project and vLLM.
  • Compatibility for commercial use is not specified.

Limitations & Caveats

This project is Linux-only and requires specific versions of CUDA, PyTorch, and vLLM. It also mandates a "warm-up" period of over 10 inference tasks to preheat the TRT model.

Health Check
Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
25 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Nikola Borisov Nikola Borisov(Founder and CEO of DeepInfra), and
3 more.

tensorrtllm_backend by triton-inference-server

0.2%
889
Triton backend for serving TensorRT-LLM models
Created 2 years ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
3 more.

LitServe by Lightning-AI

0.3%
4k
AI inference pipeline framework
Created 1 year ago
Updated 2 days ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

dynamo by ai-dynamo

1.0%
5k
Inference framework for distributed generative AI model serving
Created 6 months ago
Updated 22 hours ago
Feedback? Help us improve.