VibeVoice by vibevoice-community

Long-form conversational TTS synthesis

Created 2 months ago

679 stars

Top 49.9% on SourcePulse

Project Summary

VibeVoice is a novel framework for generating expressive, long-form, multi-speaker conversational audio from text. It addresses scalability and speaker consistency challenges in traditional TTS, offering high-fidelity audio up to 90 minutes with multiple speakers by leveraging efficient, ultra-low frame rate tokenization and a diffusion model.

How It Works

The core innovation lies in continuous speech tokenizers (Acoustic, Semantic) operating at 7.5 Hz, preserving fidelity while boosting efficiency for long sequences. A next-token diffusion framework integrates an LLM for context and a diffusion head for acoustic details, enabling synthesis of up to 90-minute audio with up to 4 speakers.

Quick Start & Requirements

Installation involves Docker with NVIDIA PyTorch containers (e.g., nvcr.io/nvidia/pytorch:24.07-py3), potentially needing manual flash-attention setup, or cloning the repo (git clone https://github.com/vibevoice-community/VibeVoice.git) and pip install -e .. Requires an NVIDIA GPU and CUDA. A Colab notebook is available for the 1.5B model.

Highlighted Details

Long-Form & Multi-Speaker: Synthesizes up to 90 minutes of audio with support for 4 distinct speakers.
Efficient Tokenization: Utilizes 7.5 Hz continuous speech tokenizers for enhanced computational efficiency.
Community Fork: Preserves codebase and adds features post-official repo removal.
Open-Sourced Weights: VibeVoice-7B model weights are available.

Maintenance & Community

This is a community-maintained fork established after the official repository's removal. A Discord server (https://discord.gg/ZDEYTTRxWG) facilitates community interaction. Upcoming features include unofficial training code.

Licensing & Compatibility

The specific license is not detailed. The project is intended solely for research and development, with explicit advisories against commercial or real-world application use without further testing.

Limitations & Caveats

Potential instability exists with Chinese speech; English punctuation and the Large model are recommended. Spontaneous background music generation is an emergent feature. Singing capability is emergent and may be off-key. Cross-lingual transfer can be unstable. The model inherits biases from its base LLM (Qwen2.5 1.5b). High potential for misuse in deepfakes and disinformation exists. Support is limited to English and Chinese; other languages may yield unexpected outputs. Overlapping speech is not supported.

VibeVoice by vibevoice-community

Explore Similar Projects

LLMVoX by mbzuai-oryx

VITA-Audio by VITA-MLLM

vui by fluxions-ai

fast-voice-assistant by dsa

MOSS-TTSD by OpenMOSS

ComfyUI-VibeVoice by wildminder

FireRedTTS2 by FireRedTeam

VoxCPM by OpenBMB

KittenTTS by KittenML

Orpheus-TTS by canopyai

Zonos by Zyphra

ChatTTS by 2noise