Discover and explore top open-source AI tools and projects—updated daily.
Long-form conversational TTS synthesis
New!
Top 72.7% on SourcePulse
VibeVoice is a novel framework for generating expressive, long-form, multi-speaker conversational audio from text. It addresses scalability and speaker consistency challenges in traditional TTS, offering high-fidelity audio up to 90 minutes with multiple speakers by leveraging efficient, ultra-low frame rate tokenization and a diffusion model.
How It Works
The core innovation lies in continuous speech tokenizers (Acoustic, Semantic) operating at 7.5 Hz, preserving fidelity while boosting efficiency for long sequences. A next-token diffusion framework integrates an LLM for context and a diffusion head for acoustic details, enabling synthesis of up to 90-minute audio with up to 4 speakers.
Quick Start & Requirements
Installation involves Docker with NVIDIA PyTorch containers (e.g., nvcr.io/nvidia/pytorch:24.07-py3
), potentially needing manual flash-attention setup, or cloning the repo (git clone https://github.com/vibevoice-community/VibeVoice.git
) and pip install -e .
. Requires an NVIDIA GPU and CUDA. A Colab notebook is available for the 1.5B model.
Highlighted Details
Maintenance & Community
This is a community-maintained fork established after the official repository's removal. A Discord server (https://discord.gg/ZDEYTTRxWG
) facilitates community interaction. Upcoming features include unofficial training code.
Licensing & Compatibility
The specific license is not detailed. The project is intended solely for research and development, with explicit advisories against commercial or real-world application use without further testing.
Limitations & Caveats
Potential instability exists with Chinese speech; English punctuation and the Large model are recommended. Spontaneous background music generation is an emergent feature. Singing capability is emergent and may be off-key. Cross-lingual transfer can be unstable. The model inherits biases from its base LLM (Qwen2.5 1.5b). High potential for misuse in deepfakes and disinformation exists. Support is limited to English and Chinese; other languages may yield unexpected outputs. Overlapping speech is not supported.
22 hours ago
Inactive