async_cosyvoice by qi-hua

Async acceleration for CosyVoice2 inference

Created 8 months ago

442 stars

Top 67.7% on SourcePulse

Project Summary

Async CosyVoice accelerates CosyVoice2 inference for Linux users by leveraging vLLM for LLM acceleration and multi-estimator instances for the Flow component. This results in significantly reduced inference times (RTF from 0.25-0.30 to 0.1-0.15) and improved concurrency, making it suitable for researchers and developers needing faster text-to-speech generation.

How It Works

The project integrates vLLM to speed up the Large Language Model (LLM) portion of CosyVoice2 inference. The "Flow" component utilizes official load_jit or load_trt modes, enhanced by multiple estimator instances provided by hexisyztem. This hybrid approach aims to maximize throughput and minimize latency by optimizing both the language generation and the acoustic modeling stages.

Quick Start & Requirements

Install: Clone CosyVoice, then clone async_cosyvoice within it. Install dependencies via pip install -r requirements.txt in the async_cosyvoice directory.
Prerequisites: Python 3.10.16, Conda, CUDA 12.4, vllm==0.7.3, torch==2.5.1, onnxruntime-gpu==1.19.0. Requires downloading CosyVoice2 - 0.5B model files.
Setup: Requires model download and file copying. Configuration is done in config.py.
Docs: CosyVoice

Highlighted Details

Achieves RTF of 0.1-0.15, halving original inference time.
Supports ~150-250ms first-packet latency for streaming.
Enables 20 non-streaming or 10 streaming concurrent tasks on a 4070 GPU.
Introduces spk2info.pt for faster inference by skipping prompt audio processing.

Maintenance & Community

Project is a fork/extension of the original CosyVoice.
No specific community links (Discord/Slack) or roadmap mentioned in the README.

Licensing & Compatibility

The README does not explicitly state the license for async_cosyvoice. It depends on the licensing of the original CosyVoice project and vLLM.
Compatibility for commercial use is not specified.

Limitations & Caveats

This project is Linux-only and requires specific versions of CUDA, PyTorch, and vLLM. It also mandates a "warm-up" period of over 10 inference tasks to preheat the TRT model.

async_cosyvoice by qi-hua

Explore Similar Projects

flex-nano-vllm by changjonathanc

dash-infer by modelscope

recurrentgemma by google-deepmind

ScaleLLM by vectorch-ai

index-tts-vllm by Ksuriuri

tensorrtllm_backend by triton-inference-server

nndeploy by nndeploy

distributed-llama by b4rtaz

Whisper-Finetune by yeyupiaoling

LitServe by Lightning-AI

whisper-jax by sanchit-gandhi

dynamo by ai-dynamo