chatterbox-vllm by randombk

vLLM-accelerated TTS generation

Created 8 months ago

367 stars

Top 77.2% on SourcePulse

Project Summary

This project ports the Chatterbox Text-to-Speech (TTS) model to the vLLM inference engine, targeting users seeking significantly improved performance and GPU memory efficiency for high-throughput TTS generation. It offers basic speech cloning with audio and text conditioning, featuring controllable exaggeration and Context Free Guidance (CFG).

How It Works

The port leverages vLLM's optimized inference capabilities to overcome the CPU-GPU synchronization bottlenecks of the original Hugging Face Transformers implementation. It achieves substantial speedups by integrating Chatterbox's two-stage generation process (T3 Llama token generation and S3Gen waveform generation) within vLLM's PagedAttention mechanism. This allows for more efficient GPU utilization and batching.

Quick Start & Requirements

Install via uv or pip: git clone https://github.com/randombk/chatterbox-vllm.git, cd chatterbox-vllm, uv venv, source .venv/bin/activate, uv sync.
Prerequisites: git, uv. Requires a compatible vLLM version (tested with 0.9.2).
Setup time: Minimal, model weights are downloaded automatically.
Example: python example-tts.py
Benchmarking: benchmark.py

Highlighted Details

Achieves ~4x speedup without batching and over 10x with batching compared to the original implementation.
Implements Context Free Guidance (CFG) and exaggeration control.
Uses internal vLLM APIs and workarounds, requiring specific vLLM versions.
Benchmarks show significant speedups on RTX 3090 (24GB VRAM) and RTX 3060ti (8GB VRAM).

Maintenance & Community

This is a personal project. Updates and discussions regarding vLLM limitations can be followed at https://github.com/vllm-project/vllm/issues/21989.

Licensing & Compatibility

The repository does not explicitly state a license. The original Chatterbox project is Apache 2.0 licensed. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The project relies on internal vLLM APIs and "hacky workarounds," making it dependent on specific vLLM versions (tested with 0.9.2) and potentially unstable. Learned speech positional embeddings are not yet applied, though quality degradation is reportedly minimal. Server API is out of scope, and APIs are not yet stable.

chatterbox-vllm by randombk

Explore Similar Projects

marvis-tts by Marvis-Labs

qwen3-tts-apple-silicon by kapi2800

ComfyUI-F5-TTS by niknah

ComfyUI-VoxCPM by wildminder

ComfyUI_IndexTTS by billwuhao

FireRedTTS by FireRedTeam

sesame_csm_openai by phildougherty

ComfyUI-Qwen-TTS by flybirdxx

WhisperSpeech by WhisperSpeech

Zonos by Zyphra

Qwen3-TTS by QwenLM

Spark-TTS by SparkAudio