Discover and explore top open-source AI tools and projects—updated daily.
Frontier Text-to-Speech for long conversations
New!
Top 6.8% on SourcePulse
VibeVoice is a novel framework for generating expressive, long-form, multi-speaker conversational audio from text, targeting applications like podcast creation. It addresses limitations in traditional Text-to-Speech (TTS) systems by enabling longer audio generation, maintaining speaker consistency, and facilitating natural turn-taking in conversations with multiple speakers.
How It Works
VibeVoice utilizes continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate (7.5 Hz) for efficient, high-fidelity audio processing. It employs a next-token diffusion framework, integrating a Large Language Model (LLM) for contextual understanding and dialogue flow, and a diffusion head for generating detailed acoustic features. This approach allows for the synthesis of up to 90 minutes of audio with up to four distinct speakers.
Quick Start & Requirements
nvcr.io/nvidia/pytorch:24.07-py3
). Manual installation of flash attention may be required if not included. Alternatively, clone the repository and install via pip install -e .
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
VibeVoice is intended for research and development purposes only and is not recommended for commercial or real-world applications without further testing. It has a high potential for misuse in creating deepfakes and disinformation. The model currently only supports English and Chinese transcripts, does not handle non-speech audio (background noise, music), and does not generate overlapping speech. Users are responsible for ensuring transcript reliability, content accuracy, and lawful deployment. Disclosure of AI-generated content is best practice.
2 weeks ago
Inactive