chatterbox-vllm  by randombk

vLLM-accelerated TTS generation

Created 7 months ago
357 stars

Top 78.4% on SourcePulse

GitHubView on GitHub
Project Summary

This project ports the Chatterbox Text-to-Speech (TTS) model to the vLLM inference engine, targeting users seeking significantly improved performance and GPU memory efficiency for high-throughput TTS generation. It offers basic speech cloning with audio and text conditioning, featuring controllable exaggeration and Context Free Guidance (CFG).

How It Works

The port leverages vLLM's optimized inference capabilities to overcome the CPU-GPU synchronization bottlenecks of the original Hugging Face Transformers implementation. It achieves substantial speedups by integrating Chatterbox's two-stage generation process (T3 Llama token generation and S3Gen waveform generation) within vLLM's PagedAttention mechanism. This allows for more efficient GPU utilization and batching.

Quick Start & Requirements

  • Install via uv or pip: git clone https://github.com/randombk/chatterbox-vllm.git, cd chatterbox-vllm, uv venv, source .venv/bin/activate, uv sync.
  • Prerequisites: git, uv. Requires a compatible vLLM version (tested with 0.9.2).
  • Setup time: Minimal, model weights are downloaded automatically.
  • Example: python example-tts.py
  • Benchmarking: benchmark.py

Highlighted Details

  • Achieves ~4x speedup without batching and over 10x with batching compared to the original implementation.
  • Implements Context Free Guidance (CFG) and exaggeration control.
  • Uses internal vLLM APIs and workarounds, requiring specific vLLM versions.
  • Benchmarks show significant speedups on RTX 3090 (24GB VRAM) and RTX 3060ti (8GB VRAM).

Maintenance & Community

This is a personal project. Updates and discussions regarding vLLM limitations can be followed at https://github.com/vllm-project/vllm/issues/21989.

Licensing & Compatibility

The repository does not explicitly state a license. The original Chatterbox project is Apache 2.0 licensed. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The project relies on internal vLLM APIs and "hacky workarounds," making it dependent on specific vLLM versions (tested with 0.9.2) and potentially unstable. Learned speech positional embeddings are not yet applied, though quality degradation is reportedly minimal. Server API is out of scope, and APIs are not yet stable.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
13 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.