index-tts-vllm  by Ksuriuri

Accelerated TTS inference with vLLM

created 3 months ago
352 stars

Top 80.3% on sourcepulse

GitHubView on GitHub
Project Summary

This project enhances IndexTTS, a text-to-speech system, by integrating vLLM for significantly faster inference. It targets researchers and developers needing high-throughput TTS capabilities, offering substantial speedups and improved concurrency for GPT model decoding.

How It Works

The project leverages vLLM's optimized inference engine to accelerate the GPT model component of IndexTTS. This approach utilizes techniques like paged attention and continuous batching to maximize GPU utilization and throughput, resulting in a ~3x reduction in Real-Time Factor (RTF) and a ~3x increase in GPT decoding speed compared to the original IndexTTS.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n index-tts-vllm python=3.12), activate it, and install dependencies (pip install -r requirements.txt).
  • Prerequisites: PyTorch (2.7.0 recommended for vLLM 0.9.0, or 2.5.1 for vLLM 0.7.3 if GPU is not supported), HuggingFace or ModelScope compatible IndexTTS-1.5 model weights.
  • Setup: Download model weights, convert them using bash convert_hf_format.sh /path/to/your/model_dir, and then run VLLM_USE_V1=0 python webui.py or VLLM_USE_V1=0 python api_server.py --model_dir /your/path/to/Index-TTS. Initial startup may involve CUDA kernel compilation for BigVGAN.
  • Docs: Project Repository

Highlighted Details

  • Achieves RTF of ~0.1 and GPT decoding speeds of ~280 tokens/s on a single RTX 4090.
  • Supports concurrent requests, with practical testing showing stable performance for ~16 concurrent users.
  • Introduces multi-character audio mixing by blending声线 from multiple reference audio inputs.
  • Maintains comparable Word Error Rate (WER) to the original IndexTTS.

Maintenance & Community

No specific community channels or notable contributors are mentioned in the README.

Licensing & Compatibility

The project's licensing is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README notes that using multiple reference audio inputs for multi-character mixing can lead to unstable output声线. The VLLM_USE_V1=0 flag is mandatory for running the provided scripts, indicating potential compatibility issues with vLLM v1.

Health Check
Last commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
18
Star History
358 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.