index-tts-vllm  by Ksuriuri

Accelerated TTS inference with vLLM

Created 4 months ago
579 stars

Top 55.9% on SourcePulse

GitHubView on GitHub
Project Summary

This project enhances IndexTTS, a text-to-speech system, by integrating vLLM for significantly faster inference. It targets researchers and developers needing high-throughput TTS capabilities, offering substantial speedups and improved concurrency for GPT model decoding.

How It Works

The project leverages vLLM's optimized inference engine to accelerate the GPT model component of IndexTTS. This approach utilizes techniques like paged attention and continuous batching to maximize GPU utilization and throughput, resulting in a ~3x reduction in Real-Time Factor (RTF) and a ~3x increase in GPT decoding speed compared to the original IndexTTS.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n index-tts-vllm python=3.12), activate it, and install dependencies (pip install -r requirements.txt).
  • Prerequisites: PyTorch (2.7.0 recommended for vLLM 0.9.0, or 2.5.1 for vLLM 0.7.3 if GPU is not supported), HuggingFace or ModelScope compatible IndexTTS-1.5 model weights.
  • Setup: Download model weights, convert them using bash convert_hf_format.sh /path/to/your/model_dir, and then run VLLM_USE_V1=0 python webui.py or VLLM_USE_V1=0 python api_server.py --model_dir /your/path/to/Index-TTS. Initial startup may involve CUDA kernel compilation for BigVGAN.
  • Docs: Project Repository

Highlighted Details

  • Achieves RTF of ~0.1 and GPT decoding speeds of ~280 tokens/s on a single RTX 4090.
  • Supports concurrent requests, with practical testing showing stable performance for ~16 concurrent users.
  • Introduces multi-character audio mixing by blending声线 from multiple reference audio inputs.
  • Maintains comparable Word Error Rate (WER) to the original IndexTTS.

Maintenance & Community

No specific community channels or notable contributors are mentioned in the README.

Licensing & Compatibility

The project's licensing is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README notes that using multiple reference audio inputs for multi-character mixing can lead to unstable output声线. The VLLM_USE_V1=0 flag is mandatory for running the provided scripts, indicating potential compatibility issues with vLLM v1.

Health Check
Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
18
Star History
181 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.